Estimating composite functions 
by model selection 



Yannick Baraud 

Universite de Nice Sophia- Antipolis 
Laboratoire J-A Dieudonne 

Lucien Birge 

Universite Paris VI 
Laboratoire de Probabilites et Modeles Aleatoires 
U.M.R. C.N.R.S. 7599 



February 9, 2011 



Abstract 

We consider the problem of estimating a function s on [— 1,1]*^ for large 
values of k by looking for some best approximation by composite functions of 
the form g o u. Our solution is based on model selection and leads to a very 
general approach to solve this problem with respect too many different types of 
functions g, u and statistical frameworks. In particular, we handle the problems of 
approximating s by additive functions, single and multiple index models, neural 
networks, mixtures of Gaussian densities (when s is a density) among other 
examples. We also investigate the situation where s = gou for functions g and u 
belonging to possibly anisotropic smoothness classes. In this case, our approach 
leads to a completely adaptive estimator with respect to the regularity of s. 

1 Introduction 



In various statistical problems, we have at hand a random mapping X from a mea- 
surable space (0,^) to (X, Af) with an unknown distribution Pg on X depending on 
some parameter s G 5 which is a function from [— 1,1]*^ to R. For instance, s may 
be the density of an i.i.d. sample or the intensity of a Poisson process on [—1,1]*^ 
or a regression function. The statistical problem amounts to estimating s by some 
estimator s = 's{X) the performance of which is measured by its quadratic risk, 
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where d denotes a given distance on 5. To be more specific, we shall assume in 
this introduction that X = (Xi, . . . , Xn) is a sample of density (with s > 0) 
with respect to some measure /U and d is the Hellinger distance. We recall that, 
given two probabilities P, Q dominated by fi with respective densities / = dP/ dfj, and 
g = dQ/dfi, the Hellinger distance h between P and Q or, equivalently, between / 
and g (since it is independant of the choice of is given by 

h^{P,Q) = h\f,g) = ^ J (/7-V5)'dM. (1.1) 

It follows that ^/2d{s, t) is merely the L2-distance between s and t. A general method 
for constructing estimators s~is to choose a model S for s, that is, do as if s belonged 
to S and to build ? as an element of S. Sometimes the statistician really assumes 
that s belongs to S and that S is the true parameter set, sometimes he does not 
and rather considers S as an approximate model. This latter approach is somewhat 
more reasonable since it is in general impossible to be sure t hat s does belong to S. 
Given S and a suitable estimator s, as those built in Birge (ioO^) for example, one 
can achieve a risk bound of the form 



R{s, s)<C 



mfd^(s,t) +tV(S) 



(1.2) 



where C is a universal constant (independent of s), T){S) the dimension of the model S 
(with a proper definition of the dimension) and r, which is equal to 1/n in the specific 
context of density estimation, characterizes the amount of information provided by 
the observation X. 

It is well-known that many classical estimation procedures suffer from the so-called 
"curse of dimensionality" , which means that the risk bound (jl.2p deteriorates when 
k increases and actually becomes very loose for even moderate values of k. This 
phenomenon is easy to explain and actually connected with the most classical way 
of choosing models for s. Typically, and although there is no way to check that such 
an assumption is true, one assumes that s belongs to some smoothness class (Holder, 
Sobolev or Besov) of index a and such an assumption can be translated in terms of 
approximation properties with respect to the target function s of a suitable collection 
of linear spaces (generated by piecewise polynomials, splines, or wavelets for example). 
More precisely, there exists a collection § of models with the following property: for 
all > 1, there exists a model G S with dimension D which approximates s with 
an error bounded by cD~^l^ for some c independent of D (but depending on s, a. 
and fe). With such a collection at hand, we deduce from ()1.2p that whatever D > 1 
one can choose a model S = S{D) E S for which the estimator 's £ S achieves a 
risk bounded from above by C [c'^Z)"^"/*^ + rZ?] . Besides, by using the elementary 
Lemma [1] below to be proved in Section [6.51 one can optimize the choice of D, and 
hence of the model S in S, to build an estimator whose risk satisfies 

R{s,s) < C7max{3c2^/(2a+fc)^2a/(2a+fe).2^| _ (-^ 3) 

Lemma 1 For all positive numbers a, b and 6 and N* the set of positive integers, 
Jnf JaZ)-^ + bD}<b + min i^2a^/(<'+^^b'/^'+^^;a] < max 
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Since the risk bound (II. 3p is achieved for D of order j-^ / {'^°'+^) ^ as r tends to 0, 
the deterioration of the rate 7-^"/^^""'"^^ when k increases comes from the fact that 
we use models of larger dimension to approximate s when k is large. Nevertheless, 
this phenomenon is only due to the previous approach based on smoothness assump- 
tions for s. An alternative approach, assuming that s can be closely approximated 
by suitable parametric models the dimensions of which do not depend on k would 
not suffer from the same weaknesses. More generally, a structural assumption on s 
associated to a collection of models S', the approximation properties of which improve 
on those of S, can only lead to a better risk bound and it is not clear at all that as- 
suming that s belongs to a smoothness class is more realistic than directly assuming 
approximation bounds with respect to the models of S'. Such structural assumptions 
that would amount to replacing the large models involved in the approximation of 
smooth functions by simpler ones have been used for many years, especially in the 
context of regression. Examples of such structural assumptions are provided by ad- 
ditive models, the singl e ind ex model, models based on radial approximation as in 
Donoho and John stone ( 19891 ). the projection pursuit algorithm indroduced b y Frie d- 



and Tuckey (an overview of th e procedure is available in Huber jigsi)), 



man 

and artificial neural networks as in Barron (Il993l : Il994l ). among other examples. 

In any case, an unattractive feature of the previous approach based on an a priori 
choice of a model 5 G S is that it requires to know suitable upper bounds on the 
distances between s and the models 5 in S. Such a requirement is much too strong 
and an essential improvement can be brought by the modern theory of model selection. 
More precisely, given some prior probability vr on S, model selection allows to build 
an estimator s with a risk bound 

R{s,s) < C inf {inf + r [V{S) + log (1/^(5))] | , (1.4) 

for some universal constant C. If we neglect the influence of log (l/7r(5)), which 
is connected to the complexity of the family S of models we use, the comparison 
between ()1.2p and (jl.4l) indicates that the method approximately selects among S a 
model leading to the smallest risk bound. 

With such a tool at hand, that allows us to play with many models simultaneously 
and let the estimator choose a suitable one, we may freely introduce various models 
corresponding to various sorts of structural assumptions on s that avoid the "curse 
of dimensionality". We can, moreover, mix them with models that are based on 
pure smoothness assumptions that do suffer from this dimensional effect or even with 
simple parametric models. 

The main purpose of this paper is to provide a method for building various sorts 
of models that may be used, in conjonction with other ones, to approximate func- 
tions on [—1, 1]*^ for large values of k. The idea, which is not new, is to approximate 
the unknown s by a composite function g o u where g and u have different approx- 
imation p ropert ies. Recent works in this directioiican be found in Horowitz and 
Mammen ( 2007 ) or Juditsky, Lepski and Tsybakov ( 20091 ). Actually, our initial moti- 



vation for this research was a series of lectures given at CIRM in 2005 by Oleg Lepski 
about a former version of this last paper. There are, nevertheless, major differences 
between their approach and ours. They deal with estimation in the white noise model, 
kernel methods and the Lq^-Ioss. They also assume that the true unknown density 
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s to be estimated can be written as s = g o u where g and u have given smoothness 
properties and use these properties to build a kernel estimator which is better than 
those based on the overall smoothness of s. The use of the Loo-loss indeed involves 
additional difficulties and the minimax rates of convergence happen to be substan- 
tially slower (not only by logarithmic terms) than the rates one gets for the L2-I0SS, 
as the authors mention on page 1369, comparing their results with those of Horowitz 
and Mammen (2007). 

Our approach is radically different from the one of Juditsky, Lepski and Tsybakov 
and considerably more general as we shall see, but this level of generality has a price. 
While they provide a constructive estimator that can be computed in a reasonable 
amount of time, although based on supposedly known smoothness properties of g and 
u, we offer a general but abstract method that applies to many situations but does 
not provide practical estimators, only abstract ones. As a consequence, our results 
about the performance of these estimators are of a theoretical nature, to serve as 
benchmarks about what can be expected from good estimators in various situations. 

We actually consider "curve estimation" with an unknown functional parameter 
s and measure the loss by L2-type distances. Our construction applies to various 
statistical frameworks (not only the Gaussian White Noise but all these for which a 
suitable model selection theorem is available) . We also do not assume that s = g o u 
but rather approximate s by functions of the form g o u and do not fix in advance 
the smoothness properties of g and u but rather let our estimator adapt to it. This 
approach leads to a completely adaptive method with many different possibilities to 
approximate s. It allows, in particular, to play with the smoothness properties of g 
and u or to mix purely parametric models with others based on smooth functions. 
Since methods and theorems about model selection are already available, our main 
task here will be to build suitable models for various forms of composite functions 
g o u and check that they do satisfy the assumptions required for applying previous 
model selection results. 

2 Our statistical framework 

We observe a random element X from the probability space (O, A, P^) to (X, X) with 
distribution Pg on X depending on an unknown parameter s. The set S of possible 
values of s is a subset of some space Lq(i5^, ^u) where is a given probability on the 
measurable space {E,£). We shall mainly consider the case q = 2 even though one 
can also take g = 1 in the context of density estimation. We denote by d the distance 
on L,q{E, jj) corresponding to the Lq(£?, |Lt)-norm || • \\q (omitting the dependency of d 
with respect to q) and by the expectation with respect to so that the quadratic 
risk of an estimator s is [d^ (s, s) ] . The main objective of this paper, in order to 
estimate s by model selection, is to build special models S that consist of functions of 
the form f ot where t = (ti . . . , t;) is a mapping from to 7 C R', / is a continuous 
function on I and I = Ylj=i Ij is a product of compact intervals of M. Without loss 
of generality, we may assume that / = [—1, 1]'. Indeed, if / = 1, t takes its values in 
Ii = [P — a, (3 + a], a > and / is defined on Ii, we can replace the pair (/, i) by 
(/, t) where i{x) = a^^[t{x) — /3] and f{y) = f{ay + (3) so that t takes its values in 
[—1,1] and f ot = f ot. The argument easily extends to the multidimensional case. 
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2.1 Notations and conventions 

To perform our construction based on composite functions f o t, we introduce the 
following spaces of functions : T C I^q{E, n) is the set of measurable mappings from 
E to [—1, 1], J^/,oo is the set of bounded functions on [—1, 1]' endowed with the distance 
doc given by doo{f,g) = supa.g[_i \ f{x) - g{x)\ and Ji,c is the subset of Ti^oo which 
consists of continuous functions on [—1, 1]'. We denote by N* (respectively, M^) the 
set of positive integers (respectively positive numbers) and set 

[z\ = sup{j eZ\j<z} and \z] = inf{j G N* | j > z}, for all z G M. 

The numbers x A y and x V y stand for min{a;,y} and maxjx,?/} respectively and 
log_|_(x) stands for (logx) V 0. The cardinality of a set A is denoted by \A\ and, by 
convention, "countable" means "finite or countable" . We call subprobability on some 
countable set A any positive measure tt on A with n{A) < 1 and, given vr and a £ A, 
we set 7r(a) = 7r({a}) and A7r(a) = — log(7r(o)) with the convention At^^q) = +oo 
if 7r(a) = 0. The dimension of the linear space V is denoted by ^^{V). Given a 

o 

compact subset K of M with K^^ 0, we define the Lebesgue probability on by 
[J-{A) = X{A)/\{K) for A C K, where A denotes the Lebesgue measure on M.^. 

For X € M™, Xj denotes the j^^ coordinate of x (1 < j < m) and, similarly, 
Xij denotes the j^^ coordinate of Xi if the vectors Xi are already indexed. We set 
|xp = X^JLi x'j for the squared Euclidean norm of x G M"*, without reference to the 
dimension m, and denote by Bm the corresponding unit ball in and by 1 the vector 
with unit coordinates in M™. Similarly, \x\^ = max{|xi|, . . . , \x„i\} for all x G M™. 
For X in some metric space (M, d) and r > 0, B{x, r) denotes the closed ball of center 
x and radius r in M and for A C M, d{x,A) = infy^A d{x , y) . Finally, C stands 
for a universal constant while C is a constant that depends on some parameters of 
the problem. We may make this dependence explicit by writing C'{a, b) for instance. 
Both C and C are generic notations for constants that may change from line to line. 

2.2 A general model selection result 

General model selection results apply to models which possess a finite dimension in 
a suitable sense. Throughout the paper, we assume that in the statistical framework 
we consider the following theorem holds. 

Theorem 1 Let S be a countable family of finite dimensional linear subspaces S of 
hg{E,iJ.) and let vr be some subprobability measure on §. There exists an estimator 
s = ^{X) with values in UsesS satisfying, for all s G 5, 

E, [cZ^ {s, s)]<C inf (d^ {s, S) + T [{V{S) V 1) + A^{S)] } , (2.1) 

s 

where the positive constant C and parameter r only depend on the specific statistical 
framework at hand. 

Similar results often hold also for the loss function d'^{s,'s) (r > 1) replacing d'^{s,'s). 
In such a case, the results we prove below for the quadratic risk easily extend to the 
risk Eg [d'^{s,'s)]. For simplicity, we shall only focus on the case r = 2. 
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2.3 Some illustrations 



The previous theorem actually holds for various statistical frameworks. Let us provide 
a partial list. 

Gaussian frameworks A prototype for Gaussian frameworks is provided by some 



Gaussian isonormal linear process as described in Section 2 of Birge and Massart (j200ll ) 
In such a case, X is a Gaussian linear process with a known variance r, indexed by 
a subset S of some Hilbert space l^2{E,fJ.). This means that s G S determines the 
distribution P^. Regression with Gaussian errors and Gaussian sequences can both 
be seen as particular cases of this framework. Then Theorem [1] is a consequence 



of Theorem 2 of Birge and Massart ( 200ll ). In the regression setting, the pr actica l 



case of an unknown variance has been considered in Baraud, Giraud and Huet ( 20091 ) 
under the additional assumption that 'D(S) V A.,^{S) < n/2 for all S G §. 

Density estimation Here X = {Xi, . . . ,Xn) is an n-sample with density with 
respect to ^ and S is the set of nonnegative elements of norm 1 in L,2{E,^). Then 
d{s,t) = \/2h (s^,i^) where h denotes the Hellinger di stanc e between densities, r = 
and Theorem [1] follows from Theorem 6 of Birge ( 20061 ). Alternatively, one can 



take for s the density itself, for S the set of nonnegative elements of norm 1 in Li (E, ^) 
and set q = 1. The result then follows from Theorem 8 of Birge ( 20061 ). Under the 



additional assumptio n tha t s £ L2(-E,^) H Loo(-E', )u), the case q = 2 follows from 
Theorem 6 of Birge ( 20081 ) with r = \\s\\ao (1 V log 



Regression with fixed design We observe X = {(xi, Yi), . . . , (x^, Yn,)} with 
E[li] = s{xi) where s is a function from E = to M and the errors 

Ei = Yi — s(xj) are i.i.d.. Here lu, is the uniform distribution on E, hence cP{s,t) = 
^"^]^[s(xj) — t(xj)]^ and r = 1/n. When the errors ei are subgaussian, Theorem[T] 
follows from Theorem 3.1 in Baraud, Comte and Viennet ( 200ll ). For more heavy- 



tailed distributions (Laplace, Cauchy, etc.) we refer to Theorem 6 of Baraud ( 201ll ) 
when s takes its values in [—1,1]. 

Bounded regression with random design Let (X, Y) be a pair of random vari- 
ables with values in x [—1, 1] where X has distribution ^ and E[y|X = x] = s(x) 
is a function from E to [—1, 1]. Our aim here is to estimate s from the observation 
of n independent copies X = {(Xi, Yi), . . . , y„)} of (X, Y). Here the distance 
d corr espon ds to the L2(-E, //)-distance and Theorem [1] follows from Corollary 8 in 
Birge (jiooi) with r = n'^. 



Poisson processes In this case, X is a Poisson process on E with mean measure 
• /u, where s is a nonn egative element of L2(i^', /u). Then r = 1 and Theorem [1] 



follows from Birge (|2007l ). 



3 Approximation of functions 

In this section, we give a brief overview of more or less classical collections of models 
commonly used for approximating smooth (and less smooth) functions on [—1,1]*^. 
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We shall use these collections to approximate composite functions of the form g o u 
by approximating g and u separately. Finally, we shall explain the basic ideas of our 
approach at the end of this section. 

Collections of models with the property below will be of special interest throughout 
this paper. 

Assumption 1 The elements of the collection S are finite dimensional linear spaces 
and for each G N the number of elements of § with dimension D is hounded by 
exp[c(§)(-D + 1)] for some nonnegative constant c(§) depending on S only. 

3.1 Classical models for approximating functions 

Along this section, d denotes the L2-distance in L2([— 1, 1]^ ,2~^dx), thus taking q = 2, 
E = [—1, 1]*' and ^ the Lebesgue probability on E. 

3.1.1 Approximating smooth functions on [—1,1]^ 

When A: = 1, a typical smoothness condition for a function s on [—1,1] is that it 
belongs to some Holder space 1, 1]) with a = r + a',rGN and < a' < 1 

which is the set of all functions / on [—1,1] with a continuous derivative of order r 
satisfying, for some constant L{f) > 0, 

<L{f)\x-yr' for ah X, ye [-1,1]. 

This notion of smoothness extends to functions /(xi, . . . ,Xk) defined on [—1, 1]'^, by 
saying that / belongs to ^^([—1, 1]'^) with a = (ai, . . . , at) € (0, +00)^^' if, viewed as 
a function of Xi only, it belongs to 'H"*([— 1, 1]) for 1 < i < k with some constant L{f) 
independent of both i and the variables xj for j 7^ i. The smoothness of a function s 
in Ti^d—l, 1]^) is said to be isotropic if the are all equal and anisotropic otherwise, 
in which case the quantity a given by 

1 - i V — 

a k ^ ai 

1=1 

corresponds to the average smoothness of s. It follows from results in Approximation 
Theory that functions in the Holder space 'H^([— 1, l]'^) can be well approximated 
by piecewise polynomials on fc-dimensional hyperrectangles. More preci sely, our next 
proposition follows from results in Dahmen, DeVore and Scherer ( IQSd ). 



Proposition 1 Let {k, r) G N* x N. There exists a collection of models Mk^r = 
IJ^y^Mk^riD) satisfying Assumption {1\ such that for each positive integer D, the 
family Mk^riD) consists of linear spaces S with dimensions T>{S) < C[{k,r)D spanned 
by piecewise polynomials of degree at most r on k- dimensional hyperrectangles and 
for which 

inf dis,S)< inf d^is,S)<C'oik,r)L(s)D~'^'^, (3.1) 



for all s G 1-L^{[—1, \]^) with sup;^<j<;j 



Oil < r + 1 . 
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3.1.2 Approximating functions in anisotropic Besov spaces 



Anisotropic Besov spaces generalize anisotropic Holder spaces and are defined in a 
similar way by using directional moduli of smoothness, just as Holder spaces are de- 
fined using directional derivatives. To be short, a function belongs to an anisotropic 
Besov space on [—1,1]'^ if, when all coordinates are fixed apart from one, it belongs 
to a Besov space on [—1,1]. A precise definition (rest ricted to fc = 2 but which 



can be generalized easily) can be found in Hochmuth (120021 ). The general defini- 
tion together with useful approximation properties by piecewise polynomials can be 
found in Akakpo 0009). For < p < +oo, k > I and /3 e (0,+oo)'', let us de- 

note by Bp^p{[— 1,1]^) the anisotropic Besov spaces. In particular, fi^^oo([— 1, l]'^) = 
Uf^{[-l,lf). It follows fr om Akakpo that Proposition [T] can be generalized 



to Besov spaces in the following way. 

Proposition 2 Let p > 0, A; G N* and r G N. There exists a collection of models 
Bfc,r = y}D>i'^k,r{D) Satisfying Assumption\^ such that for each positive integer D, 
^k,r{D) consists of linear spaces S with dimensions T>{S) < C[{k,r)D spanned by 
piecewise polynomials of degree at most r on k-dimensional hyperrectangles and for 
which _ 

inf d{s,S)<C'^{k,r,p)\s\Q D-^/^ (3.2) 

for alls G B^,p{[- 1,1]'') with semi-norm and f3 satisfying 

sup/3i<r + l and ^5 > A: [(p"^ - 2^^) V O] . (3.3) 

l<i<fc 



3.2 Approximating composite functions 
3.2.1 Preliminary approximation results 

As already mentioned, we assume along the paper that s is either of the form gou ox 
can be well approximated by a function of this form, where u = {ui, . . . , ui) belongs 
to and g to J7 c- The purpose of this section is to see how well f ot approximates 
gou when we know how well / approximates g and t = (ti, . . . ,ti) approximates u. 
We start with the definition of the modulus of continuity of a function g in J^i^c- 

Definition 1 We say that w from [0,2]' to Mf^ is a modulus of continuity for a 
continuous function g on [—1,1]' if for all z e [0,2]', w{z) is of the form w{z) = 
{wi{zi), . . . ,wi{zi)) where each function wj with j = l,...,l is continuous, nonde- 
creasing and concave from [0,2] to M-|_, satisfies Wj(0) = 0, and 

I 

\9{x) - g{y)\ < X^^id^i f"''' allx,y G [-1,1]'. 

i=i 

For a G (0,1]' and L G (0,+cxd)', we say that g is {cx,'L)-Hdlderian if one can take 
Wj{z) = LjZ°'^ for all z G [0, 2] and j = 1, . . . ,1. It is said to be Jj-Lipschitz if it is 
{a, 'L)-Holderian with ct = (1, . . . , 1). 
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Note that our definition of a modulus of continuity implies that the Wj are subadditive, 
a property which we shall often use in the sequel and that, given g, one can always 
choose for Wj the least concave major ant of Wj where 

Wj{z) = sup \g{x) - g{xi,. . . ,Xj-i,Xj + z,Xj+i,. . . ,xi)\. 

a;G[-l,l]'; Xj<l-z 



Then Wj(2) < 2wj{z), according to Lemma 6.1 p. 43 of DeVore and Lorentz (jl993l ). 
Let us now turn to our main approximation result to be proved in Section 16.61 

Proposition 3 Let p > 1, g £ J'i,cj f £ -^/,oo o'^'^ t,u £ T'. // Wg is a modulus of 
continuity for g, then 

\\gou - f ot\\p < doo{g,f) + 2^/P^Wgj {\\uj -tjWp) 

i=i 

with the convention 2^/°^ = 1. 



3.2.2 The main ideas underlying our construction 

Let us take p = q = 2 and E = [—1,1]'^ with A; > / > 1. A consequence of Propo- 
sition [3] is the following. If one considers a finite dimensional linear space F C ^7,00 
for approximating g and compact sets Tj C T for approximating the Uj, there exists 
t G T = nj=i such that the linear space St = {f o t \ f G F} approximates the 
composite function g o u with an error bound 

I 

d{gou,St)<doo{g,F)+V2Y,'^gAd(.Uj,Tj)). (3.4) 

i=i 

The case where the function g is Lipschitz is of particular interest since, up to con- 
stants, the error bound we get is the sum of those for approximating separately g by 
F (with respect to the Loo-distance) and the uj by Tj. In particular, if s were exactly 
of the form s = g o u for some known functions uj, we could use a suitable linear 
space F with dimension of order D in Mi^q{D) to approximate g, and take Tj = {uj} 
for all j. In this case the linear space Su whose dimension is also of order D would 
approximate s = g o u with an error bounded by D^^/'. Note that if the Uj did 

belong to some Holder space ?^^([-l,l]'') with (3 G (0,l]^ the overaU regularity of 
the function s = g o u could not be expected to be better than /3-IIolderian, since 
this regularity is already achieved by taking g{yi, ■ ■ ■ ,yi) = yi- In comparison, an 
approach based on the overall smoothness of s, which would completely ignore the 
fact that s = gou and the knowledge of the Uj, would lead to an approximation bound 
of order D~^^^ . The former bound, D^^/', based on the structural assumption that 
s = gou therefore improves on the latter since /3 < 1 and k > I. Of course, one could 
argue that the former approach uses the knowledge of the Uj , which is quite a strong 
assumption for statistical issues. Actually, a more reasonable approach would be to 
assume that u is unknown but close to a parametric set T, in which case, it would 
be natural to replace the single model 5„ used for approximating s, by the family of 
models St^{F) = {St \ t S T} and, ideally, let the usual model selection techniques 
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select some best linear space among it. Unfortunately, results such as Theorem [T] do 
not apply to this case, since the family St^{F) has the same cardinality as T and 
is therefore typically not countable. The main idea of our approach is to take ad- 
vantage of the fact that the uj take their values in [—1, 1] so that we can embed T 
into a compact subset of 7"'. We may then introduce a suitably discretized version 
T of T (more precisely, of its embedding) and replace the ideal collection S7^{F) by 
§t(-^); for which similar approximation properties can be proved. The details of this 
discretization device will be given in the proofs of our main results. Finally, we shall 
let both T and F vary into some collections of models and use all the models of the 
various resulting collections St(-^) together in order to estimate s at best. 

4 The basic theorems 

4.1 Model selection using classical approximation spaces 

If we assume that the unknown parameter s to be estimated is equal or close to 
some composite function of the form g o u with u ^ and g G J-i^c find if we wish 
to estimate g o u hy model selection we need to have at disposal families of models 
for approximating both g and the components Uj, 1 < j < I oi u. As already seen 
in Section 13. H typical sets that are used for approximating elements of Ti^c or 
are finite-dimensional linear spaces or subsets of them and we need a theorem which 
applies to such classical approximation sets for which it will be convenient to choose 
the following definition of dimension. 

Definition 2 Let H be a linear space and S C H. The dimension T>{S) G NU {oo} 
of S is if \S\ = 1 and is, otherwise, the dimension (in the usual sense) of the linear 
span of S. 

Our construction of estimators 's of gou will be based on some set (3 of the following 
form: 

6 = {/,F,7,Ti,...,T;,Ai,...,Aa, / G N*, (4.1) 

where F, Ti, . . . denote families of models and 7, Xj are measures on F and Tj 
respectively. In the sequel, we shall assume that 6 satisfies the following requirements. 

Assumption 2 The set (3 is such that 

i) the family ¥ is a countable set and consists of finite- dimensional linear subspaces 
F of Fi^oo with respective dimensions 'D(F) > 1, 

a) for j = 1, . . . ,1, Tj is a countable set of subsets oflLq(E,n) with finite dimen- 
sions, 

Hi) the measure 'j is a subprobability on ¥, 

iv) for j = 1, . . . ,1, Xj is a subprobability on Tj. 

Given 6, one can design an estimator Jwith the following properties. 

Theorem 2 Let & satisfy Assumption [H One can build an estimator s = S'(X) 
satisfying, for all u and g G Fi^c with modulus of continuity Wg, 

I 

CEs[d^{s,s))] < Y.mf{lwl^{d{uj,T))+T[Ax^{T) + i{g,j,T)V{T)]} 
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+ d\s, gou)+ inf {dlig, F) + r [A^{F) + V{F)]}, (4.2) 

where i{g,j,T) = 1 ifV{T) = and otherwise, 

i{g,j,T) = inf {i E | /w^^- (e"^) < TiV{T)} < +oo. (4.3) 

Note that, since the risk bound (j4.2p is vahd for all g £ Fi^c and n G 7"', we can 
minimize the right-hand side of (14. 2 p with respect to g and u in order to optimize the 
bound. The proof of this theorem is postponed to Section 16.31 

Of special interest is the case where g is L-Lipschitz. If one is mainly interested by 
the dependence of the risk bound with respect to r as it tends to 0, one can check that 
i{g,j,T) < logr~^ for r small enough (depending on I and L) so that (14. 2j) becomes 
for such a small r 

I 

C'Es[d' {s,s)] < Y^inf {d\n„T)+T{Ax^{T)+V{T)logT-')} 

j=i ^^^^ 

+d\s,go u) + inf F)+T [V{F) + A^(F)]} . 

If it were possible to apply Theorem [J to the models F with the distance doo and 
the models T with the distance d for each j separately, we would get risk bounds 
of this form, apart from the value of C and the extra logr^^ factor. This means 
that, apart from this extra logarithmic factor, our procedure amounts to performing 
I + 1 separate model selection procedures, one with the collection F for estimating g 
and the other ones with the collections Tj for the components Uj and finally getting 
the sum of the / + 1 resulting risk bounds. The result is however slightly different 
when g is no longer Lipschitz. When g is (Q!,L)-Holderian then one can check that 
i{9,j,T) < Cj^T where £j-t = 1 if ^(T) = and, if V{T) > 1, 



C 



j,T = [aj' log {IL][tV{T)]-')]\/i (4.4) 
< C"(/,a,) [log(r-i) V log(L2/P(r)) V l] . (4.5) 

In this case. Theorem [2] leads to the following result. 

Corollary 1 Assume that the assumptions of Theorem IM holds. For all (q!,L)- 
Hdlderian function g with a G (0, 1]' and L G (M;^)', the estimator s of Theorem\E 
satisfies 

I 

CE,[d^s,s)] < inf {/L2d2".(^.,r) + r [AA^,(r) + P(r)£,-T]} 

+ d\s, gou)+ inf F) + r [A,(F) + P(F)]}, (4.6) 

where Cj^t is defined by i4-4\ ) o,nd bounded by |^.5[ ) 

4.2 Mixing collections corresponding to different values of / 

If it is known that s takes the special form g o u for some unknown values of 5 G J-i^c 
and u G T\ or if s is very close to some function of the form g o u, the previous 
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approach is quite satisfactory. If we do not have such an information, we may apply 
the previous construction with several values of / simultaneously, approximating s by 
different combinations gi o ui with ui taking its values in [—1,1]', gi a function on 
[-1,1]' and / varying among some subset / of N*. To each value of I, we associate, 
as before, / + 1 collections of models and the corresponding subprobabilities, each 
I then leading to an estimator s; the risk of which is bounded by TZ{si,guUi) given 
by the right-hand side of ()4.2p . The model selection approach allows us to use all 
the previous collections of models for all values of / simultaneously to build a new 
estimator the risk of which is approximately as good as the risk of the best of the s;. 
More generally, let us assume that we have at hand a countable family i € /} 
of sets of the form ()4.ip satisfying Assumption [2] for some / = /(£)> 1. To each 
such set, Theorem [2] associates an estimator with a risk bounded by 

W.s[(f {s,si)] < inf n(si,g,u), 

where TZ{si,g,u) denotes the right-hand side of ()4.2p when G = Gi and the infimum 
runs among all pairs (ff, n) with g G ^l{l),c u G T'*-^-*. We can then prove (in 
Section [63] below) the following result. 

Theorem 3 Let I he a countable set and v a subprobability on I. For each i €z I 
we are given a set Ge of the form 14. 1\ ) that satisfies Assumption \M with I = 
and a corresponding estimator 'si provided by Theorem\^ One can then design a new 
estimator s = satisfying 

C¥.s\d^ (s,s)] <inf inf {nisi,g,u) + T^Ji)] , 
eel (g,u) 

where 7^(s^,g,u) denotes the right-hand side of {-j.S^ when G = Gi and the second 
infimum runs among all pairs {g,u) with g € J~i(e),c ^^'^ ^ ^ T'^^^. 

5 Applications 

The aim of this section is to provide various applications of Theorem [2] and its corol- 
laries. For approximating functions on [—1, 1]^, we shall repeatedly use the families 
Mk^r and B^^^ introduced in Section [3Tl Since for all r > 0, M^^r satisfies Assumption[T] 
for some constant c(]HIfc ,,); the measure 7 on M^^r defined by 

A^(5) = (c(IHfc,,) + l)p + l) for all 5 G ]HIfc,,(D) \ J IHa,.,,(D') (5.1) 

1<D'<D 

is a subprobability since 

^ g-A,(5) < ^ g-D |e,^,,(D)| e-^(=*-)(^+i) < e-^ < 1. 

5eHi._^ D>1 D>1 

We shall similarly consider the subprobability A defined on B^^^ by 

Ax{S) = {c{Mk,r) + l)iD + l) for all 5 G Bfc,,(Z)) \ [J Bfc,,(D')- (5-2) 

1<D'<D 

Finally, for g G 1^«([-1,1]') = ^^,^([-1,1]') with a G (M^)', we set \\g\\a,^ = 
1 5 1 Q, 00, 00 + inf L' where the infimum runs among all numbers L' for which Wgj{z) < 
I'z'^jM for ah z G [0, 2] and j = 1, . . . , /. 



12 



5.1 Estimation of smooth functions on [—1,1]*^ 

In this section, our aim is to establish risk bounds for our estimator s when s = gou 
for some smooth functions g and u. We shah discuss the improvement, in terms of 
rates of convergence as r tends to 0, when assuming such a structural hypothesis, as 
compared to a pure smoothness assumption on s. Throughout this section, we take 
q = 2, E = [—1, l]'^ and d as the L2-distance on 'L2{E, 2~''dx). 



5.1.1 Convergence rates using composite functions 

Let us consider here the set Sk.i{c(, P,p, L, R) gathering the composite functions 

gou with g £ T-L^d—l, 1]') satisfying 11(7110; ^ ^ "^j ^ ^p/,Pj with semi- norms 
\uj\a < Rj for all ?' = 1, . . . , /. The following result holds. 

Theorem 4 There exists an estimator 's such that, for all I > 1, a,R £ (1^+)^ 
L > 0, /3i,...,/3; G (RX)^ and p G (0,+oo]' with > k (^pj^ - 2'^^ VO for 

sup C'Es[d^{s,s)] 

2?J(qjA1) 



i=i 

where C = log(r^^) V log(L^) V 1 and C depends on k, I, a, (3 and p. 

Let us recall that we need not assume that s is exactly of the form gou but rather, as 
we did before, that s can be approximated by a function s = gou G Sk,i{oc, f3, p, L, R). 
In such a case we simply get an additional bias term of the form in our risk 

bounds. 

Proof: Let us fix some value of / > 1 and take s = g o u £ Sk^iia, f3,p, L, R) and 
define 



r{a,l3) = 1 + 



max a,- \/ max /3,- , 



The regularity properties of g and the uj together with Propositions [T] and [2] imply 
that for all > 1, there exist F G IH; ,.(^) and sets Tj G B^. ,.(1?) for j = 1, . . . ,1 such 
that 

V{F) < C[{l,a,p)D- d^{g,F) < C'^{l,cx, (3) LD-^/'- 
and, for 1 < j < /, 

V(Tj) < C'^{k,cx,l3j,pj) D; d{uj,Tj) < C'^{k,a, pj,pj) RjD-^/K 

Since the collections H/ ,. and satisfy Assumption [T] and Wgj{z) < Lz'^J^^ for all 
j and z G [0,2], we may apply Corollary [1] with 
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the subprobabilities 7;^^ and Xi^r being given by (jS.ip and (|5.2I) respectively. Besides, 
it follows from (j4.5p that Cj^t < C'{l,a)C for all j, so that (j4.6p implies that the risk 
of the resulting estimator is bounded from above by 



y inf 

^ D>1 



+ DtC 



+ inf 

Z)>1 



for some constant C depending on l,k,a,Pi, . . . ,/3;. We obtain the result by opti- 
mizing each term of the sum with respect to D, by means of Lemma [H and by using 
Theorem [3] with v defined for £ = (/,r) G x N by i^(/,r) = e~*^'+''+^) for which 
r)T < (/ + r + l)Tl{'si^r, 9, u) for all /, r. [] 



5.1.2 Structural assumption versus smoothness assumption 

In view of discussing the interest of the risk bounds provided by Theorem [H let us 
focus here, for simplicity, on the case where g G 'H'^{[—1, 1]) with a > (hence / = 1) 
and n is a function from E = [—1,1]'^' to [—1,1] that belongs to T-L^d—l,!]^) with 
(3 G (M^)'^. The following proposition is to be proved in Section [6.71 

Proposition 4 Let (j) be the function defined on (M]^)^ by 

, , . [ xii if xV V < 1: 
(h(x,y) = < ^ \, ' 
y X f\y otherwise. 

For aUk>l, a>0, (3 e (M^)^ g G ^""([-l, 1]) and u G ^^([-1, 1]''), 

gou ^n^d-l,!]^) with Oi = (f){(3i,a) forl<i<k. (5.3) 

Moreover, is the largest possible value for which ( 15. 3)) holds for all g G ^"([—1, 1]) 
and u G '^^([—1,1]'^) since, whatever 0' G (M^)'^ such that 6[ > 9i for some i G 
{1, . . . , k}, there exists some g G 'H"([— 1, 1]) and u G ^^([—1, 1]*^) such that g o u ^ 
?^^'([-l,l]^). 

Using the information that s belongs to ^^([— 1, l]'^) with given by (|5.3p and that 
we cannot assume that s belongs to some smoother class (although this may happen 
in special cases) since is minimal, but ignoring the fact that s = g o u, we can 
estimate s at rate tends to 0) while, on the other hand, by using 

Theorem m and the structural information that s = g o u, we can achieve the rate 

^2a/{2a+l) ^ [logT~^] ) 2F(aAl)/(2^{aAl)+fc) _ 

Let us now compare these two rates. First note that it follows from (15. 3p that 9i < a 
for all i, hence 9 < a and, since A; > 1, 2a/ {2a + 1) > 29/ {29 + k). Therefore the 
term T-2"/(2a+i) always improves over when r is small and, to compare the 

two rates, it is enough to compare 9 with /3(a A 1). To do so, we use the following 
lemma (to be proved in Section 16. 8p . 
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Lemma 2 For all a > and j3 G (M* )'^, the smoothness index 

e = (0(a,/3i), . . . ,0(a,/3fc)) 

satisfies 9 < (3{a A 1) and equality holds if and only if sup]^<j<^ /3j < a V 1. 

When sup]^<j<^ /3i < a V 1, our special strategy does not bring any improvement as 
compared to the standard one, it even shghtly deteriorates the risk bound because of 
the extra logr~^ factor. On the opposite, if supi<j<;j, /3j > a V 1, our new strategy 
improves over the classical one and this improvement can be substantial if (3 is much 
larger than a V 1. If, for instance, a = 1 and fi = k = j3j for all j, we get a bound of 

order [r (log r^"*^)] which, apart from the extra logr~^ factor, corresponds to the 
minimax rate of estimation of a Lipschitz function on [—1,1], instead of the risk bound 
^2/{2+k) would get if we estimated s as a Lipschitz function on [—1,1]'^. When 

our strategy does not improve over the classical one, i.e. when supi<j<;j j3i < a VI, the 
additional loss due to the extra logarithmic factor in our risk bound can be avoided by 
mixing the models used for the classical strategy with the models used for designing 
our estimator, following the recipe of Section 14.21 



5.2 Generalized additive models 

In this section, we assume that E = [—1,1]^ , fj, is the Lebesgue probability on E and 
q = 2. A special structure that has often been considered in regression corresponds 
to functions s = g o u with 

n(xi, . . . = ni(xi) + . . . + Uk{xk)] s{x) = g (ui(xi) + . . . + Uk{xk)) , (5.4) 

where the Uj take their values in [— 1/A:,1/A:] fo r all ? = 1,...,A;. Such a model 
has been considered in Horowitz and Mammen ( 200?! ) and while their approach is 



non-adaptive, ours, based on Theorem [2] and a suitable choice of the collections of 
models, allows to derive a fully adaptive estimator with respect to the regularities 
of g and the Uj. More precisely, for r G N, let T,. be the collection of all models 
of the form T = Ti + . . . + Tk where for j = 1, . . . , A:, Tj is the set of functions of 
the form x i— )• tj{xj) with x £ E and tj in Bi^,.. Using A,. = A as defined by (|5.2p . 

(k) 

we endow with the subprobability Af defined for T G T,. by the infimum of the 
quantities Y[i=i ^r{Ti) when (Ti, . . . ,Tfc) runs among all the A:-uplets ofM'l^ satisfying 

T = Ti + ... + Tk. Finally, for a,L>0, (3,R£ (R\)^ and p = (pi, . . . ,pk) G (M* 
:Add/ 



let S^'^'^{a,p,p,L,R) be the set of functions of the form ([531) with g G ^"([-1, 1]) 

^nrr lUII ^ T o ti ^ „,. d R'^J /" [_ 1 iTr^fln L 



satisfying IblU^oo ^ ^ ^'^^ % ^ ^^pj.Pj ([-!' 1]) with \uj\^ < Rjk ^ for ah j 



l,...,k. Using the sets 6r = (1, Eti^^, 7^, T^, aI'^^) with r G N we can build an 
estimator with the following property. 

Theorem 5 There exists an estimator s" which satisfies for all a,L > 0, p,R € 
(IR^)'= and (3 G with fij > {l/pj - 1/2)^ for allj = l,..., k, 

sup C'Ks[d^{s,s)] 

seS^'i'^(a,P,p,L,R) 
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where C = log (r ^) V log (L^) V 1 and C' is a constant that depends on a,/3,p and 
k only. 

If one is mainly interested in the rate of convergence as r tends to 0, the bound we get 
is of order max{r2"/(2"+i), [rlog(r-i)]2("/^i)/3/(2("Ai)/3+i)} ^hgre /3 = min{/3i, . . . 
In particular, if a > 1, this rate is the same as that we would obtain for estimating 
a function on [—1, 1] with the smallest regularity among a, . . . , 

Proof: Let us consider some s = g o u ^ S^^'^{a, /3, p, L, R) and r = 1 + [a V /3i V 
...V /3k]. For all D,Di,...,Dk > 1, there exist F e Hi .^(1?) and Tj G Mi^r{Dj) for 
all j = 1, . . . , A; such that 

V{F) < C[{r)D; doo{9,F) < C!,{r)LD-''; 

and, for 1 < j < k, 

T>iTj) < C'^ik, r, p) Dj; d{uj,Tj) < C'^{k, r, p) Rjk-^Dj^' . 

liT = Ti + ... + n, thenP(T) < Ei=i^(^j)> V^^^ " ^>^r{Tj) < (c(Bi,.) + 

1) Ej=i(^i + !)• Moreover, d(tx,r) < •=! d(nj,rfc) < C^A:"i ^J'^^ RjD/' , hence, 



d2(^x,r) < {Cifk-^ZLiRpf^' and finally 



< (^/)2(aAl)^(^^.^-l/2)2(aAl)^-2{aAl)/3,_ 

i=i 

For all T, £i_j' < C'{a)C and since Wg(z) < Lz" for all z € [0,2], we may apply 
Corollary [1] with / = 1 and get that the risk of the resulting estimator satisfies 

k 

C'nisr,g,u) = inf [L2(i?,A;-i/2)2{-Ai)^-2{aAi)/3, ^ ^^^1 ^ -^^f [L^D-^" + Dt] . 

We conclude by arguing as in the proof of Theorem 21 [] 

5.3 Multiple index models and artificial neural networks 

In this section, we assume that E = [—1, 1]'^, q = 2 and d is the distance in h2{E,fi) 
where /j, is the Lebesgue probability on E. We denote by | • |i and | • |oo respectively 
the £i- and ^oo-norms in R*"' and the unit ball for the ^i-norm. As we noticed 
earlier, when s is an arbitrary function on E and k is large, there is no hope to get a 
nice estimator for s without some additional assumptions. A very simple one is that 
s{x) can be written as g{{6,x)) for some 6 G C^, which corresponds to the so-called 
single index model. More generally, we may pretend that s can be well approximated 
by some function s of the form 

where 6i, . . . ,9i are I elements of Ck and g maps [-1,1]' toM, / being possibly unknown 
and larger than k. When s = g o u is oi this form, the coordinate functions Uj{-) = 
{9j, ■) , for 1 < j < /, belong to the set Tq C T of functions on E of the form x i— )• {9, x) 
with 9 G Cfc, which is a subset of a fc-dimensional linear subspace of l^2{E,fJ.), hence 
^(^o) ^ k. A slight generalization of this situation leads to the following result. 
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Theorem 6 For j > 1, let Tj be a subset of T with finite dimension kj and for 
I C N* and I £ I, let ¥i be a collection of models satisfying Assumptionsl^(i and Hi) 
for some subprobability 7^ . There exists an estimator s which satisfies 

CEs [dHs,s)] 



< 



inf inf 

lei geTLc,ueTi 



d^{s,gou) + A{g,¥i,ji) + TS^^i^ia, j^Tj) 



(5.5) 



where T, 



TiX ...xTi, i{g,j,Tj 



is 



defined by and 



inf {dlig,F)+T[ViF) + A^,iF)]}. 

In particular, for all I £ I and {a,'L)-Hdlderian functions g with a G (0,1]' and 
L G (Ml)' 



CEs[d^ {s,s))] 

I 

d^{s, g o u) + A{g,¥i,-fi) + rY^kj 



< 



inf 

ueTi 



a. 



■\og{lL]{k,T)-^)\l I 



. (5.6) 



Let us comment on this result, fixing some value / G /. The term d(s, gou) corresponds 
to the approximation of s by functions of the form g{ui{.), . . . ,ui{.)) with g in J^^^ 
and ui, . . . , in Ti, . . . , respectively. As to the quantity A{g, F;, 7^), it corresponds 
to the estimation bound for estimating the function g alone if s were really of the 
previous form. Finally, the quantity r Yl]=i kji{g,j, Tj) corresponds to the sum of the 
statistical errors for estimating the Uj. If for all j, the dimensions of the Tj remain 
bounded by some integer k independent of r, which amounts to making a parametric 
assumption on the Uj, and if g is smooth enough the quantity t^'- ^ kji{g,j,Tj 



IS 



then of order r log r for small values of r as seen in (j5.6p . 

Proof of Theorem 0.- For all j, we choose Xj to be the Dirac mass at Tj so that 
Ax.{Tj) = = d{uj,Tj). The result follows by applying Theorem [2] (for a fixed value 
of Z G /) and then Theorem [3] with u defined by u^l) = e~' for all / G I. □ 



5.3.1 The multiple index model 

As already mentioned, the multiple index model amounts to assuming that s is of the 
form 

s{x) = g (^{9i,x), . . . , {61, x)) whatever x £ E, 

for some known / > 1 and kj = k for all j. For L > and a G (IR+)', let us 
denote by Sf^{L) the set of functions s of this form with g G ^'^([—1, 1]') satisfying 
WdWa 00 — ^- ^PPlyii^S Theorem[6]to this special case, we obtain the following result. 



Corollary 2 Let I C N*. There exists an estimator s such that for all I £ I, a E 

(M* )' and L > 0, 

sup C'Es[d^ is,s)] < L^T^+krC, 
se5p(L) 
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where C = log(r ^) Vlog(L^A; "'^) V 1 and C' is a constant depending on I and a only. 

Proof: Fix s = g o u €z Sf'{L) and apply Theorem O with Tj = Tq for all j > 1, 
/ = {I}, = H/^r and 7/ defined by ()5.ip with k = I and r = \ai V ... V a;] . Arguing 
as in the proof of Theorem HI we obtain an estimator fhe risk of which satisfies 



inf f L^D-^"/' + dA + TkC 

D>1 \ J 



< C" 



2 2a 
L2a + lT-2a + l _|_ J^j £ 



for constants C" and C" depending on / and a only. Finally, we conclude as in the 
proof of Theorem HI Q 



5.3.2 Case of an additive function g 

In the multiple index model, when the value of / is allowed to become large (typically 
not smaller than k) it is often assumed that g is additive, i.e. of the form 

g{yi,...,yi) = gi{yi) + ...+gi{yi) for all y g [-1, 1]', (5.7) 

where the gj are smooth functions from [—1,1] to M. Hereafter, we shall denote by 
jrAdd ^YiQ set of such additive functions g. The functions s = gou with g E ^f^c"^ and 
u £ Tq hence take the form 

I 

s{x) = gj {{Oj, x)) for all xeE. (5.8) 

i=i 

For each j = 1, let Fj be a countable family of finite dimensional linear subspaces 
of ^1,00 designed to approximate gj and 7j some subprobability measure on Fj. Given 
(Fi, . . . , Fi) G \tj=i ^j' define the subspace F of Ji^oo as 

F = {/(yi, ...,yi) = Myi) + ... + fi{yi) \ fj G F, for 1 < j < / } (5.9) 

and denote by F the set of all such F when {Fi, . . . , Fi) varies among rij=i ^j- Then, 
we define a subprobability measure 7 on F by setting 

I I 
7(i^) = n^^-(^:'-) °^ A,{F) = Y,^,AFj), 

when F is given by (|5.9|) . For such an F, doo{g,F) < X]j=i ^00(5^1 Fj), hence 

d^(5,F) <lT!j=id'L{9j^Pj) and V{F) < Y!j^^V{Fj). We deduce from Theoremg] 
the following result. 

Corollary 3 Let / C N* and for j > 1, let Fj he a collection of finite dimensional 
linear subspaces of Fi^oo satisfying Assumption\^i) and-iii) for some subprobability 
7j . There exists an estimator s such that 

I 

d^{s,gou) + ^ (i?j(5,Fj, A^^,) + Tki{g,j,TQ)) , 



CE \d'^(s,s)] < inf inf 
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where 



Rj{g,¥j,A^^)= inf {dl{g, , F,) + r [D{Fj) + A^^{F,)] } forl<j<l. 



Moreover, if s of the form \5. ^) for some I G / and functions gj G Ti"^ {[—1,1]) 
satisfying \\gj \\^ . ^ < Lj for aj, Lj > and all j = 1, . . . , /, one can choose the ¥j 
and 7j in such a way that 



Es [d^{s,s] < C 



Y^Lp^'r^^ + krC 



(5.10) 



where C = log (r ^) V 1 V Vj=i log (^L'jk and C is a constant depending on I 
and only. 

For j > 1, Rj = Rj(g,¥j, A^.) corresponds to the risk bound for the estimation 
of the function gj alone when we use the family of models ¥j, i.e. what we would get 



if we knew 9 



J and that gi 



for all i ^ j. In short, Yl'^j=i corresponds to the 
estimation rate of the additive function g. If each gj belongs to some smoothness 
class, this rate is similar to that of a real-valued function defined on the line with 
smoothness given by the worst component of g, as seen in (jS.lOp . 

Proof of Corollary The first part is a straightforward consequence of Theorem [6l 
For the second part, fix s = g o u and r = \_ai V ... V a/J . Since the gj are {aj A 
1, Lj)-IIolderian, i{g,j,To) < C C for some C depending on the oij only. By using 

for 
for 



Proposition [H Lemma [T] and the collection F^^^ = Hi^,. with 7^^^ defined by (|5.1 



^2/(2aj+l) 2aj/(2aj+l) 



+ r 



ah i = 1, flj < C'\nlD>i{L)D~^^^ + Dr} < C"{L^. 
some constants C, C" depending on the aj only. Putting these bounds together, we 
end up with an estimator 'sr the risk of which is bounded from above by the right- 
hand side of (jS.lOp . We get the result for all values of r by using Theorem [3] and 
arguing as in the proof of Theorem [H q 



5.3.3 Artificial neural networks 

In this section, we consider approximations of s on ii^ = [—1,1]^ by functions of the 
form 

I 

s{x) = YRji;{{aj,x) +bj) with \bj\ + \aj\i < 2"^ , (5.11) 
i=i 

for given values of (/, q) e I = (N^f. Here, R = {Ri, . . . , Ri) G M', Oj G M*^, bj G M 
for j = 1, . . . ,1 and is a given uniformly continuous function on M with modulus of 
continuity w^. We denote by Si^q the set of all functions s of the form ()5.1ip . 

Let us now set '4>q{y) = ip {2'^y) for y G M and, for x £ E, Uj{x) = 2~'^ {{aj,x) + bj), 
so that Uj G T belongs to the (A;-|- l)-dimensional spaces of functions of the form x 1— >■ 
(a, x) +b. We can then rewrite s in the form gou with g{yi, . . . ,yi) = J2j=i ^ji^qiVj)- 
Since g belongs to the /-dimensional linear space F spanned by the functions ijjg{yj), 
we may set F = {F}, A^{F) = and apply Theorem[6l With Wgj{y) = \Rj\w^ {'^'^u)-. 
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5.511 becomes, 



CEs [d^ {s, s;,q) ] < d^{s, s) + Tik + 1) ^ inf {i G N^^ | IR]wI (2'?e-*) < (A; + l)ri } . 
If w^(y) < Ly" for some L>0, 0<a<l and all y G M+, then, according to ()4.4p . 



< d\s,s)+lkr^qlogi + a~Hog^(^l\R\l^L^[kT]-^^Y (5.12) 

These bounds being valid for all {l,q) € / and s G we may apply Theorem [3] to 
the family of all estimators 'si^q, {l,q) G / with v given by v{l,q) = e~^~'^. We then 
get the following result. 

Theorem 7 Assume that ip is a continuous function with modulus of continuity 
w^(y) bounded by Ly" for some L>0, 0<a<l and all y G M+. Then one 
can build an estimator s = S'(X) such that 

CE,[d^ {s,s))] 

< inf inf ld^{s,s)+lkTq\l + {qa)-^log^(l\R\l^L'^[kT]'^)]} . (5.13) 

{l,q)&I SGSl^q I L V /J J 



Approximation by functions of the form (jS.lip . Various authors have pro- 
vided conditions on the function s so that it can be approximated within rj by func- 
tions s of the form ()5.1ip for a given function ^. An extensive lis t of a uthors and 
results is provided in Section 4.2.2 of Barron, Birge and Massart (|l999l l and some 
proofs are provided in Section 8.2 of that paper. The starting point of such approxi- 
mations is the assumed existence of a Fourier representation of s of the form 

s{x) = Ks cos {{a, x) +5{a))dFs{a), K^GR, |(5(a)| < vr, 

for some probability measure Fs on M'^. To each given function ip that can be used 
for the approximation of s is associated a positive number /3 = > and one has 
to assume that 

(5.14) 



J \a\idFs{a) < +oo, 



in order to control the approximation of s by functions of the form (15.111). A careful 
inspection of the proof of Proposition 6 in Barron, Birge and Massart ( 19991 ) shows 
that, when (|5.14|) holds, one can derive the following approximation result for s. 
There exist constants > 1, 7,/, > and > depending on Tp only, a number 
Rs,(3 ^ 1 depending on c^,/? only and some s G Si^g with < Rs,i3 such that 



d (s, s) < KgCj 



+ R^ J-y^ 



for q > 



(5.15) 
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Putting this bound into ()5.13p and omitting the various indices for simplicity, we get 
a risk bound of the form 

n{l, q) = CK^ [2-297 + ^2^-1 ^ K-Hkrq [l + {qa)'^ log+ {lR^L^[kT\-^)]] , 

to be optimized with respect to / > 1 and q > q^. We shall actually perform the 
optimization with respect to the first three terms, omitting the logarithmic one. 

Let us first note that, if RK < ^Jl^kr^ one should set q = q-ip and 1 = 1, which 
leads to 

7^(l,g^) < Ckrq^ [1 + (<7^a)-i log+ {R^L^ikr]-^)] . 
Otherwise -s/q^kr < RK and we set 



q = q* = mi^q>q^ 2'2g7 < (R/K)^/^^ and 
If I* > 1, then RK{q*kT)-^/'^ < I* < 2RK{q*kT)-^/^ hence 



I = I* 



RK 

y/q*kT 



n{l*,q*) < CRK^Jq*kT 



1 + — log+ 
q*a 



2R^L'^K 
Ik^WWQ* 



(5.16) 



If I* = 1, then R^ < K '^q*kT and s/q^kr < RK < ^/q*kT, hence q* > q^ and 
q* — 1 > q* /2. Then, from the definition of q*, 



RK-^^J{q*/2)kT < RK-^^J{q* - l)kT < 2-2(9*-i)7 < 2-27^ 

hence ^/q*kT < (K/i?)2-27+(i/2) < and ^J6\i still holds. To conclude, we 

observe that either — 27g'^log2 < log [RK~^ ^y q^kr^ and q* = q^ or the solution zq 
of the equation 



2z7 log 2 = log K/ 



RVkT 



(1/2) log z 



satisfies q-,p < zq < q* . Since logzo ^ zq/c, it follows that 



/(27log2 + e-i) 



q*>log(K/ 
and, by monotonicity, that 



RVkT 



1 / 2R^LP'K 

? [jkrWW^ ) < £ = (27 log 2 + e-') log 



2R^L^K 



{kTfl^^^ 

where £ is a bounded function of kr. One can also check that 

log {K/ 



log 



K 



RVkT 



Q <q 



27 log 2 



and ()5.16p finally leads, when q* > q^tp, to 

"log (i^/[i?7^) 



'R-{l*,q*) < CRK kr 



27 log 2 



\ 1/2 



(5.17) 



In the asymptotic situation where r converges to zero, (j5.17p prevails and we get a 
risk bound of order [—kTlog{kT)]^^'^. 
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5.4 Estimation of a regression function and PCA 

We consider here the regression framework 

Yi = s{Xi) +ei, i = 1,... ,n, 

where the Xi are random variables with values in some known compact subset K 
of M*^ (with A; > 1 to avoid trivialities) the £i are i.i.d. centered random variables of 
common variance 1 for simplicity and s is an unknown function from to M. By 
a proper origin and scale change on the Xi, mapping K into the unit ball Bk of M^, 
one may assume that the Xi belong to B^, hence that E = B^, which we shall do 
from now on. We also assume that the Xi are either i.i.d. with common distribution 
^ on E (random design) or deterministic {Xi = Xi, fixed design), in which case 
/i = n^^ Yli=i ^Xi, where 6x denotes the Dirac measure at x. In both cases, we choose 
for d the distance in L2(-E,^). As already mentioned in Section [2.31 Theorem [1] with 
r = applies to this framework, at least in the two cases when the design is fixed 
and the errors Gaussian (or subgaussian) or when the design is random and the Yi 
are bounded, say with values in [—1, 1]. 



5.4.1 Introducing PCA 

Our aim is to estimate s from the observation of the pairs {Xi,Yi) for i = 1, . . . ,n, 
assuming that s belongs to some smoothness class. More precisely, given A C M.^ 
and some concave modulus of continuity w on IR+, we define T-L„{A) to be the class 
of functions h on A such that 



\h{x) — h{y)\ < w {\x — y\) for all x,y £ A. 

Here we assume that s is defined on Bk and belongs to ?^w(i3fc), in which case it can 
be extended to an element of Ti^ (R^), which we shall use when needed. Typically, 
if w{z) = Lz" with a G (0, 1] and the Xi are i.i.d. with uniform distribution fj, 
on E, the minimax risk bound over T-l„{Bk) with respect to the L2(-E, //)-loss is 
(j'l^2k/{k+2a)^-2a/{k+2a) (-^j^gpg (J' depends o n k a nd the distribution of the e^). 
It can be quite slow if k is large (see Stone (|l982l )). although no improvement is 
possible from the minimax point of view if the distribution of the Xi is uniform on 
Bk- Nevertheless, if the data Xi were known to belong to an affine subspace V of M.^ 
the dimension / of which is small as compared to k, so that fJ.{V) = 1, estimating the 
function s with L2(-E', ^)-loss would amount to estimating so IIv (where Ily denotes 
the orthogonal projector onto V) and one would get the much better rate 7^-2a/{i+2a) 
with respect to n for the quadratic risk. Such a situation is seldom encountered in 
practice but we may assume that it is approximately satisfied for some well-chosen V. 
It therefore becomes natural to look for an affine space V with dimension I < k such 
that s and s o Ily are close with respect to the L2(£', /i)-distance. For s G T-L^f (l^*^), 
it follows from Lemma S] below that, 



s{x) — s oIIy(x)\ d^{x) < w {\x — Hyxl) dfi{x) 
E Je 



< 2w2 



J |x — Ilyxl^ (i/x(x)^ 



1/2- 
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and minimizing the right-hand side amounts to finding an affine space V with dimen- 
sion / for which \x — Hyxp d^{x) is minimum. This way of reducing the dimension 
is usually known as PC A (for Principal Components Analysis). When the Xi are de- 
terministic and /i = Z^^Li ^Xn the solution to this minimization problem is given 
by the affine space Vi = a + Wi where the origin a = = ^17=1 -^i ^ 
Wi is the linear space generated by the eigenvectors associated to the I largest eigen- 
values (counted with their multiplicity) of XX* (where X is the k x n matrix with 
columns Xi — Xn and X* is the transpose of X). In the general case, it suffices to 
set a = xdfi (so that a £ E) and replace XX* by the matrix 

(x — a){x — a)* d^{x). 

If Ai > A2 > . . . > Afc > are the eigenvalues of F in nonincreasing order, then 

k 

\j (5.18) 



/ \x — Uy xl"^ dfi{x) = Xj 
(with the convention 'Yliz ~ ^) ^'^'^ therefore 



inf 

{V I dim(V)=l} . 



j=l+l 



inf 

{V I dim(V)=/} 



1 1 2 1 1 1 1 2 2 

Is — s o Ily < lis — S O Yiv, lU < 2w 




(5.19) 



5.4.2 PCA and composite functions 

In order to put the problem at hand into our framework, we have to express s o IIv, in 
the form g o u. To do so we consider an orthonormal basis ui, . . . of eigenvectors 
of XX* or r (according to the situation) corresponding to the ordered eigenvalues 
> A2 > . . . > Afc > 0. For a given value of / < A; we denote by the component 
of a which is orthogonal to the linear span Wi oiui, . . . ,ui and for x € ^S^, we define 



Ui[X 



{x, Uj) for j = 1, . . . ,1. This results in an element u 



(ui, 



,ui) of T and 



Setting 



=1% 



[x)uj = Uviix) is the projection of x onto the affine space Vi = a +Wi. 



9(2) 



ZjUj 



for z G [-l,l]^ 



leads to a function g o u with u £ and g G J-i^c which coincides with s o IIv; on 
Bk as required. Consequently, the right-hand side of (|5.19|) provides a bound on the 
distance between s and g o u. Moreover, since s £ Ti^ 



\9{z) -9{z')\ < w 



I 

E 



I 



w| > u,--2;„-)Ui I =w{\z-z'\), 

(5.20) 

so that we may set Wgj = w for all j G {1, . . . , /}. 

In the following sections we shall use this preliminary result in order to establish 
risk bounds for estimators s; of s, distinguishing between the two situations where /i 
is known and /i is unknown. 
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5.4.3 Case of a known // 

For D G N*, we consider the partition Vi^d of [—1, 1]' into cubes with edge length 
2/D and denote by Fi^£) the hnear space of functions which are piecewise constant on 
each element of Vi^d so that V{Fi^d) = -C' for ah D £ N*. This leads to the family 
F = {Fi^D, D e N*'} and we set j(Fi^d) = e'^ for ah D > 1. We define Uj as in the 
previous section and take for Tj the family reduced to the single model Tj = {uj} 
for j = 1, . . . ,1. Then T>{Tj) = for all j and we take for Xj the Dirac measure on 
Tj . This leads to a set (3 which satisfies Assumption [2] and we may therefore apply 
Theorem [2] which leads to an estimator 'si with a risk bounded by 

+ D 



CE, 



si 



< d (s, q o u) + inf 

D>1 



dto{g,Fi,D) + 



n 



Since s o and g o u coincide on Bk, it follows from (j5.19p that 



■9°u\\2 



1 1 2 2 

s o IIv, II2 < 2w 




Moreover, for all cubes / G Vn) and x £ I, the Euclidean distance between x and 
the center of I is at most VlD'^, hence by (j5.20p . dooig, Fi^d) < w [\flD-^^ for all 
D >1. Putting these inequalities together we see that the risk of 'si is bounded by 



^ l|2 



< W^ 



A 



+ inf 

D>1 



W 



VlD-') + - 



n 



(5.21) 



5.4.4 Case of an unknown /i 

When IX corresponds to an unknown distribution of the Xj, the matrix V is unknown, 
its eigenvectors Hi , . . . , and the vector a as well and therefore also the elements 
111, . . . , n/ of 7". In order to cope with this problem, we have to approximate the 
unknown Uj which requires to modify the definition of Tj given in the previous section, 
keeping all other things unchanged. For each t; G R'^ with \v\ < 1, we denote by the 
linear map, element of T, given by ty{x) = {x,v). Denoting by B'^ the unit sphere in 
M'^ we then set, for all Tj = T = {t^,, v £ B"^} which is a subset of a /c-dimensional 
linear subspace of L2()u). It follows that Assumption [2] remains satisfied but now with 
D(Tj) = k. Since Uj £ Tj for all j, an application of Theorem [2] leads to 



k 



CEs[d^s,s)] <-Y,K9,j,T)+d\s,g 



o u 



+ inf 

D>1 



dl,{g,Fi^D) + 



+ D 



n 



where i{g,j,T) is given by ()4.3p . Since, by (|5.20p . Wgj 

i{g,j,T)=i = mf!^i£n* 
Arguing as in the case of a known fi, we get 



\lw^ ie 



w for all j G {1, . . . , /}, 

ik 
n 



< 



CE, 



^ l|2 



< 



kli 



+ w^ 



n 




+ 



n 
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log DIVI 



Let Id 



i>iz) + l>2 and 



If i < Id, then kli/n < klD/n since iD ^ D. Otherwise, 



;2„2 ( -iD\ \ 72,„2 ( ^ ^) \ 



> 



> 



n 2n 

which shows that kli/n < 2/^w^ (a//D^^ ) + klD/n. Finahy 



C7E, 



\S - Sl\\2 



< W^ 



< W^ 



+ 



2D^ 



n 



which is, up to constants, the same as (j5.2ip . 
5.4.5 Varying / 

The previous bounds are vahd for all values of / € / = {1, . . . , A;} but we do not know 
which value of / will lead to the best estimator. We may therefore apply Theorem [3] 
with = Z~^/2 for I G I which leads to the following risk bound for the new 
estimator 's in the case of a known fi: 



CE, 



\s — s\ 



< inf inf 

ie{i,...,k} D>i 



+ log / 



n 



Apart from multiplicative constants depending only on A;, the same result holds when 



/i is unknown. If w(z) = Lz" for some L > and a G (0, 1], we get, since Ylij=i+i 



A,- < 



/)A/+i (with the convention Afc+i = 0), 



CE, 



\s — s\\ 



< inf ini I L^Uk- I) Xi+ir + L'^rD-^'' + 

le{l,...,k} D>1 ' 



+ log I 



n 



Assuming that n > L ^ to avoid trivialities and choosing D 
finally get 



2,„U/(«+2q) 



we 



CE, 



s — s 



ie{i,...M I ^ ' n n2«/('+2") | 



For I = k, we recover (up to constants) the minimax risk bound over Ti^^Bk), namely 
C"(A;)L2'=/(^+2")n-2"/('=+2") . Therefore our procedure can only improve the risk as 
compared to the minimax approach. 



5.5 Introducing parametric models 

In this section, we approximate s by functions of the form s = gou where g belongs to 
J^i^c and the components Uj of u to parametric models Tj = {uj{0, .), 6 £ Qj} C T 
indexed by subsets Qj of M'^j with kj > 1. Besides, we assume that the following 
holds. 
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Assumption 3 For each j = 1, . . . ,1, Qj C Bk {0, Mj) for some positive number Mj 
and the mapping 6 i— )■ Uj{6, .) from @j to {T,d) is {f3j, Rj)-Hdlderian for f3j G (0, 1] 



and Rj > which means that 



d{uj{e,.),uj{e',.)) < Rj \e-e'f' for aii e,e' g e 

Under such an assumption, the following result holds. 



(5.22) 



Theorem 8 Let I > 1, Ti, . . . , T/ be parametric sets satisfying Assumption\^ ¥ be a 
collection of models satisfying Assumption\^i) and j be a subprobability on F. There 
exists an estimator 's such that 



CK, [d\s,s)] 



< d' 



,s,gou) 



+ inf [dl{g,F) + r{A^{F)+V{F))] 



F<=¥ 



+ T 



^k,logil + 2MjR'/^^) 



+ Vinf 



Iwgj [e- 



+ iT[l + kjl3j 



for all g € J^i^c o-n-d- uj G Tj, j = 1, . . . ,1. 

In particular, for all {a,'L)-Hdlderian functions g with a E (0, 1]' and L G 



CE,[d\s,s)] < d\s,gou)+madlig,F) + riA^{F)+V{F))] 



+ T 



kj\og{l + 2MjR. 



) + V 1) (l + 



(5.23) 



where 



2a. 



■log 



IL] 



for j = 1, . . . ,/. 



(5.24) 



Although this theorem is stated for a given value of /, we may, arguing as before, let I 
vary and design a new estimator which achieves the same risk bounds (apart for the 
constant C) whatever the value of /. 

As usual, the quantity infjT'gF {d^ooid^^) +''"('^7(-^) + ^(-^))] corresponds to the 
estimation rate for the function g alone by using the collection F. In particular, if 
g G ^"([-1, 1]^) with a G (R* )', this bound is of order r2"/(2°+0 as r tends to for 
a classical choice of F (see Section [3?!]) . Since for all j, g is also {aj A l)-Holderian 
as a function of Xj alone, the last term in the right-hand side of (|5.23p . which is of 
order — rlogr, becomes negligible as compared to -j-2a/(2a+0 ^nd therefore, when s is 
really of the form g o u with g G 'H^{[—1, 1]') the rate we get for estimating s is the 
same as that for estimating g. 

Proof of Theorem\^ For 77 > and j = 1, . . . , / , let Qj[rj\ be a maximal subset of Qj 
satisfying |t — t'| > for all t, t' in Qj[ri\. Since Gj is a subset of the Euclidean ball 
in R'^j centered at wi th rad ius Mj, it follows from classical entropy computations 
(see Lemma 4 in Birge (j2006l )) that log \Qj[ri\\ < kj log(l + 2Mjr]-^). For all z G N*, 
let Tj^i be the image of Qj^i = 0j[(i?je*)~^/^j] by the mapping 1— )• Uj{6, .). Clearly, 



log \Tj^i\ < log \Qj,i\ < kj log ( 1 + 2MjRy'^'e'/^^ J < kj log(l + 2MjR'- 
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and because of the maximality of Qj^i and (I5.22p . for all 6 G Qj there exists 6 € Qj^i 
such that d{uj{6, .),Uj{6, .)) < Rj \0 — 0\^^ < e~* so that Tj^i is an e~*-net for Tj. For 
J = 1, we set Tj = (Ji>i that the models in Tj are merely the elements of 

the sets Tj^i. For a model T that belongs to ?j',AUi<i'<i (with the convention 

IJ0 = 0) we set 

Ax^{T) = log \Tj,i\ +i<kj log (1 + 2M,r]^^'^ + kjpT^'^ 

which defines a measure Xj on Tj satisfying 

TgTj i>l teTj,i i>l 

Since for all j and T £ Tj, 2?(r) = 0, we get the first risk bound by applying 
Theorem [2] to the corresponding set 6. To prove (|5.23|) . let us set = [Cj\ V 1 
for j = 1, . . . ,1 with Cj given by (j5.24p . so that 1 < < /2j V 1 and notice that, if 

z > £j V 1, then /L|e-2"j^ < zr (l + kj(3j^^ . If Cj > 1, then Cj < + 1 < 2£j, 

hence 

and 

l^ljie-'^^^) < /L2e-2".^0) < 2e2"j£jT (l + fcj/Sr^) < 2e^CjT (l + fej/S^^) . 
Otherwise, Cj < 1, = 1 > £j V 1 and 

/w2^j(e-^(-')) < /L2e-2«. < r (1 + fej/J-^) , 

so that in both cases Zw^ j(e~*(-^)) < 2e^{Cj V 1)t + /cj/^j""^^, which leads to the 
conclusion. [] 

5.5.1 Estimating a density by a mixture of Gaussian densities 

In this section, we consider the problem of estimating a bounded density s with respect 
to some probability // (to be specified later) on = M*^, d denoting, as before, the 
L2-distance on L2(i?, /i). We recall from Section [2.31 that Theorem [1] applies to this 
situation with r = n~^||s||oo(l V log ||s||oo)- A common way of modeling a density on 
= M'^ is to assume that it is a mixture of Gaussian densities (or close enough to 
it). More precisely, we wish to approximate s by functions s of the form 

I 

s(x) = ^ qjPirrij, T.j,x) for ah x G M'', (5.25) 

where / > 1, q = {qi,...,qi) G [0,1]' satisfies Yl'^j=i'lj — ^ ^^"^ J — li---)^ 
p{mj, Sj, .) = dj\f (m-j, Ti'j)/dfx denotes the density (with respect to fi) of the Gaussian 
distribution Af{mj,TiJ) centered at nij S M'^ with covariance matrix Sj for some 
symmetric positive definite matrix Sj. Throughout this section, we shall restrict 
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to means mj with Euclidean norms not larger than some positive number r and to 
matrices Sj with eigenvalues p satisfying p < p ^J) for positive numbers /) < p. In 
order to parametrize the corresponding densities, we introduce the set gathering 
the elements 9 of the form 6 = (m, S) where S is a positive symmetric matrix with 
eigenvalues in [/9,p] and m G Bk{0,r). We shall consider clS Si subset of M'^e^+i) 
endowed with the Euclidean distance. In particular, the set Mi. of square k x k 
matrices of dimension k is identified to M'^ and endowed with the Euclidean distance 
and the corresponding norm N defined by 



k k 

EE 4, 

i=i j=i 



if A 



i^ij) l<i<k ■ 



This norm derives from the inner product i?] = tv^AB*) (where B* denotes the 
transpose of B) on and satisfies N{AB) < N{A)N{B) (by Cauchy-Schwarz in- 
equality) and N{A) = N{UAU~^) for all orthogonal matrices U. In particular, if A 
is symmetric and positive with eigenvalues bounded from above by c, N{A) < \fkc. 
We shall use these properties later on. For h = r'^ /{Tp^) + k\og{\/2'p / p) and p the 
Gaussian distribution AA(0, 2p^/fc) on M'^ (where 1^ denotes the identity matrix) we 
define the parametric set T by 



-b/2 



For parameters Oi = {mi, Ei), . . 
a composite function g o u with 



, 9i = (mi, E;) in 0, the density s can be viewed as 



9{yi,---,yi) 



e\iyl + ... + e\iyf 



(5.26) 



and u = (til, . . . ,ui) with Uj{.) = u{6j, .) for j = 1, 



, /. With our choices of b and 



p, u{6, .) G T for all 
p{6,x) = 

< 



(m, E) G as required, since for al\ x G E 



exp 



(2pY/2 
detE 

(2pV~')'/' 



E- 



m)| 



4p2 



exp 



\x — m\ 



+ 



E- 



[X 



m] 



< (2pV2)fc/2grV(2p2) < 

An application of Theorem [8] leads to the following result. 

Corollary 4 Let s be a bounded density in I^2iE,p), r = 
M = ^fk-p + r,b = r2/(2p^) + k log{V2p/p), R = 



n 



k/2e-^/'^p-^ 



|s||oo(l V log 

and 



C{t) 



- log 

2 ^ 



,1 + A:(A; + 1) 

There exists an estimator s satisfying for some universal constant C > 
CE, [d^{s,s)] <\ni[d^{s,gou)] + lk{k + 1)t [\og{l + 2M R) + {C{t) y I)] , (5.27) 



where the infimum runs among all functions u 
form h5.26\) . 



{ui, . . . , ui) G and g of the 
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The second term in the right-hand side of (j5.27p does not depend on g nor u and is 
of order — rlogr as r tends to 0. As already mentioned, one can also consider many 
values of / simultaneously and find the best one by using Theorem [3l Up to a possibly 
different constant C, the risk of the resulting estimator then satisfies (|5.27p for all 
I > 1 simultaneously. The problem of estimating the parameters involved in a n iixtur e 
of Gaussian densities in M'^ has also been considered by Maugis and Michel (|20ld ). 



Their approach is based on model selection among a family of parametric models 
consisting of densities of the form ()5.25p . Nevertheless, they restrict to Gaussian 
densities with specific forms of covariance matrices only. 

Proof of Corollary^- First note that for all G G, \9\ = \m\ + N{T,) < r + \/kp. 
Hence, if we can prove that for all 6*0 = {rriQ, So), Oi = (mi. Si) in 

/F72 e^^/^ 

d{u{eo,.),u{ei,.)) <^ 1^0 -^il, (5.28) 

P 

Assumption [3] will be satisfied with 

Mj=M = r + Vkp and Rj = ^/k/2e'''^^ fT^ =R for j = 1, . . . , L 

We shall therefore be able to apply Theorem [8] with Tj = T for all j, r = n~^||s||oo(lV 
log ||s||oo), IF = {F} where F is the linear span of dimension 'D^F) = I of functions g of 
the form (j5.26p and 7 the Dirac mass at F. Since the functions g of the form (|5.26p are 
L-Lipschitz with Lj = 2qje^ < 2e^ for all j, we shall finally deduce (j5.27p from (|5.23p . 
We therefore only have to prove (I5.28p . Let us first note that 

cf{u{eo, .),u{ei, .)) = 2e-''h^ (AA(mo, Sg), AA(mi, S?)) , (5.29) 

where h denotes the Hellinger distance defined by (II. ip . Some classical calculations 
show that 

,2f.ri v2^ Kff v2^^ 1 exp [-\{mi - mp, (Sg + Sf)-i(mi - mp))] 
h (A/(mp,Sp),7V(mi,Si)) = 1 ^ ; -, 

det (^i^^i±5°^ 

and from the inequalities, 1 — < z and log(det A) < ti{A — 1^) which hold for all 
z G M and all matrices A such that det A > 0, by setting = Sq + Sf we deduce 
that 

4/i2(AA(mp,S2),AA(mi,S2)) 



< 2 log 



det 



Sp """Si + SpS-^ 



2 

< tr (Sg ^Si + SpSf ^ - 24) + (mi - mp, S~^(mi - mp)) 
= tr ((Sp - Si)Sq^(Sp - Si)Sj;^) + (mi - mp,S"2(mi - mp)) = U1 + U2, 

with 

J7i = tr ((Ep - Si)So ^(Sp - Si)SJ"^) and C/2 = (mi - mp, S~^(mi - mp)). 



+ (mi-mp,E (mi - mp)) 
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It remains to bound Ui and U2 from above. For Ui, taking A = (Eq — Si)Sq ^ and 
B = Sj"^(So - Si) and using the fact that the eigenvalues of Sq ^ and ^ are not 
larger than p~^, we get 

Ui = [A,B] < N{A)N{B) = iV((So-Si)Soi)iV(S-i(So-Si)) 
< iV(Soi)Ar(Sri)iV2(So-SO < ^^'(^^o-^i) . 

Let us now turn to C/2- It follows from the same arguments that the symmetric matrix 
= Eg + Sf satisfies for ah x G M^, 

(S^x, x) = |Sox|2 + > 2p2 , 

hence 

(72 = (mi- mo, S (mi - mo)) < ^^^3 • 

Putting these bounds together, we obtain that 
4/^2 (AA(mo, Eg), AA(mi, S?)) < A (Ar2(5.^ _ ^ _ ^^|2^ = ^ l^o - , 

which, together with (fOO]) . leads to ([OH]) . g 

6 Proof of the main results 

Let us recall that, in this section, d denotes the distance associated to the || ||g norm 
of L,g{E,fi) and doo the distance associated to the supnorm on Ti^oo- 

6.1 Basic theorem 

We shall first prove a general theorem of independent interest that applies to finite 
models T for functions in and is at the core of all further developments. 

Theorem 9 Let I be a countable set and v a subprobability on I. Assume that, for 
each i ^ I, we are given two countable families Ti and of subsets of T' and Fi^^o 
respectively such that each element T of is finite and each F £ ¥1 is a linear 
subspace of dimension T){F) > 1 of Ti^^o- Let Xi and 7^ be subprobabilities on and 
respectively. One can design an estimator s = s(X) satisfying, for all i £ I, all 
u £ and g G Fi^c with modulus of continuity Wg, 



CEs[d^{s,s)] < mi <jzinf^w2^^.(||n,--t,||p) + r[AA,(T) + log|T|+A,(^)] 



+ d\s,go u) + inf [d^g, F) + r [D{F) + A^,(F)]} . 



Proof: For each t G UtgT(> ^ ^^"^ F G F^ we consider the set Ft = {f o t, f G F} C 
L>q{E,fj,), which is a 2?(F)-dimensional linear space. This leads to a new countable 
family of models §£ together with a subprobability ng on given by 



Ft,te [j T,Fe¥e}; neiFt) = je{F) ^Jnf^jT\-'Xe{T). (6.1) 



30 



We then set 




S=USf and TTiFt) = iy{£)TTeiFt) for € S^. 

It follows that 

A^{Ft) = A^^iF) + ^ inf [Aa,(T) + log(|T|)] + A,(£) for Ft G S,. 

Applying Theorem [J to S and tt leads to an estimator 5 satisfying, for each ^ G /, 
CEs[d^ {s,s)] 

We now use Proposition [3] which implies that, for each f ot £ Ft, 
(f{sJot) < {\\s - g ou\\g + \\g ou - f ot\\gf 
( 

< \\s-gou\\q + doo{gj) +2^/'?^Wgj(||?ij -tjii, 

V 3=1 

< 3 (^\s - g o u\\l + dlig, ^) + E <i dl^^' " *i 

for some universal constant C since 2^/"^ < 2. The conclusion follows from a mini- 
mization over all possible choices for / and t. 

6.2 Building new models 

In order to use Theorem El which applies to finite sets T, starting from the models 
T which satisfy Assumption [2l we need to derive new models from the original ones. 
Let us first observe that, since Uj takes its values in [—1, 1] and /i is a probability on 
E, d{0,Uj) < 1. It is consequently useless to try to approximate uj by elements of 
hq{E,fj,) that do not belong to B{0,2) since always does better. We may therefore 
replace T C hq{E, fi) by (T n ^(0, 2)) U {0}, denoting again the resulting set, which 
remains a subset of some P(T)-dimensional linear space, by T. Moreover, this mod- 
ification can only decrease the value of d(T,Uj). Since now T C i3(0,2), we can use 
the discretization argument described by the following lemma. 

Lemma 3 Let T C i3(0, 2) be either a singleton (in which case T){T) = 0) or a 
subset of some V^T) -dimensional linear subspace o/Lq(£', //) with T){T) > 1. For 
each rj G (0, 1], one can find a subset T[r]] ofT with cardinality bounded by 
such that 

inf d(t, v) < inf d(t, v) + Iri A V(T)] for all veT. (6.2) 

Proof: If V{T) = 0, then T = {t}, we set T[r]] = {(-1 V t) A 1} and the result is 
immediate since v takes its values in [—1,1]. Otherwise, let T' be a maximal subset 
of T such that d{t, t') > r/ for each pair (t, t') of distinct points in T' . Then, for 
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T 



each t € T t here exists t' £ T' such that t') < rj and it follows from Lemma 4 in 
Birge (|2006l ^ that \T'\ < (5/?7)^(^). Now set T[r]] = {(-1 Vt) A 1, t G T'}. Then dO]) 
holds since 2?(T) > 1. [] 

We are now in a position to build discrete models for approximating the elements of 
TK Given j E {1, ...,/}, Tj- in Tj and some i G N*, the previous lemma provides a 
set Tj [e~*] satisfying |Tj [e~*] | < exp [^{Tj) {i + log 5)]. Moreover, 

d {uj,Tj [e-'] ) < d iuj,Tj) + [e'' A V{Tj)] for all u G and i G N*. (6.3) 
We then define the family T of models by 

1 

T = JJ Tj [e-*^] with {ij,Tj) G N* X for j = 1, . . . , / V . (6.4) 
j=i J 

Then each T = Ti [e"'!] X . . . X [e in T has a finite cardinality bounded by 

I 

log|T| <^P(T,)(i, +log5). (6.5) 
6.3 Proof of Theorem H 

Starting from the families Tj, 1 < j < I, we build the set T given by (j6.4p as 
indicated in the previous section and we apply Theorem [9] to F and T. This requires 
to define a suitable subprobability A on T, which can be done by setting, for each 
T = Ti [e-^i] X ... X T; [e"*'] in T, 

I I 
A(T) = nA,(T,)expH,P(r,)] or Aa(T) = ^ [Aa, (T,) + f,P(T,)] . 
i=i i=i 

Applying Theorem [9] to F and T with / reduced to a single element and ly the Dirac 
measure and using (j6.5p and (|6.3p which implies that 

inf Wgj {\\uj - tjWp) < Wgj ([e"*^' A T^iTj)] + d {uj, Tj)) 

< w,,, (e-^^ A V{T,)) + w,,, (d(n„ T,)) 

by the subadditivity property of the modulus of continuity Wgj, we get the risk bound 

I 

CE,[d''{s,s))] < mi^{2l[wlj{d{uj,T,)) + wlj{e-'^ AV{T,))] 



+ r [Aa,. {Tj) + {2ij + log 5)V{Tj)] } 



+ d^s, gou)+ inf {dL(ff, i^) + r [V{F) + A,(F)]} , 

where the first infimum runs among all Tj G Tj and all ij G N* for j = 1, . . . ,L 
Setting ij = i{g,j,Tj) implies that /w^^- (e~*J /\V{Tj)) < TijV{Tj), which proves 
()4.2p . As to (|4.6p . it simply derives from the fact that, if T>{T) > 1, then 

z{9J,T) < \{2a,)-'log{lL][TV{T)]-')] < [aj' log {lL][rViT)]-')]\/ 1 = C,,t. 
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6.4 Proof of Theorem [3] 

It follows exactly the line of proof of Theorem [2] via Theorem [9] with an additional 
step in order to mix the different families of models corresponding to the various sets 
To each corresponds a family of models Si and a subprobability vr^ on Ei given 
by ()6.ip . We again apply Theorem [9] with I and v as given in Theorem [3l 

6.5 Proof of Lemma [1] 

If D = 1, we get the bound a + b. When a > b, we can choose D such that 
(o/6)V(^+i) <D< (a/&)V(«+i) + 1, so that 

aD-' + bD< a(a/6)-^/(^+i) + b [(a/6)V(^+i) + l] = 6 + 2ai/(^+i)6^/(^+i) 

and the bound b + [20^/(^+1) fe^/^^+i) A a] follows. If 6 > a, the bound 2b holds, 
otherwise 6 < a^/(^'^^)b^/(^+^) and the conclusion follows. 



6.6 Proof of Proposition [3] 

It relies on the following lemma the proof of which is postponed to the end of the 
section. 

Lemma 4 Let (E, £, /i) be some probability space and w some nondecreasing and 
nonnegative concave function on M+ such that w{0) = 0. For all p £ [1, +00] and 
h G Lp(;u), 

\\w{\h\)\\,<2^'r>^{\\h\\,), 

with the convention 2-^/°° = 1. 

We argue as follows. For all G [-1,1]', \g{y) - gW)\ < =1 j yj - y'j 
and, since /i is a probability on E, 



\gou-fot\\p < \\gou- got\\p + \\got- f ot\\p 

I 



< 



I 



+ Wgot- f °A\p 



^ X^ll^fjd^i sup \g{y)-f{y)\ 

yG[-l,lY 



< 2Vp ^ Wgj (||U, - lip) + doo(<?, /), 

i=i 

which proves the proposition. Let us now turn to the proof of the lemma. 

Proof of Lemma\^ Since there is nothing to prove if \\h\\p = 0, we shall assume that 
\\h\\p > 0. The assumptions on w imply that, for all < a < 6, b~^w{b) < a~^vu{a) 
and w{a) < w{b). Consequently, for p G [l,+oo[, 

wP{\h\)dfi = 



wP{\h\)l\^^,df,+ / wP{\h\)l\^^,dfi 



< w'^ib) + l^^^^\h\n^^^,df^ < ujPib) + ^ljh\Pd„, 
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and the result follows by choosing b = \\h\\p. The case p = oo can be deduced by 
letting p go to +00. [] 

6.7 Proof of Proposition |4] 

It suffices to show that for all i G {1, . . . ,k} and x G [—1, 1]^, the map g o n^(t) = 
g o u{xi, . . . , Xi-i,t, Xj+i, . . . , Xk) from [—1, 1] into M belongs to H.^ {[—1, 1]^). If at 
least a or /3j are not larger than 1, the result is clear. Otherwise both are larger than 
1 and we can write /3j = 6j + and a = a + a' with a, & E N* and /?•, a' G (0, 1]. Both 
functions g and Ux are 6j A a times differentiable and the derivatives g'^^^ o Ux and nif^ 
for ^ = 0, . . . , biAa are Holderian with smoothness p = {j3i — hif\a) f\{a — hif\a) S (0, 1]. 
Since the derivative of order 6j A a of (7 o is a polynomial with respect to these 
functions, we derive (I5.3p from the fact that the set {'HP{[—\, 1]'^),+, .) is an algebra 
on M. 

We shall prove the second part of the proposition for the case k = 1 only since the 
general case can be proved by similar arguments. For p > 0, let hp £ ^''([—1,1]) \ 
IJpi^p'W ([—1, !])• Given a,/3 > 0, we distinguish between the cases below and the 
reader can check that for each of these g G ^"([-1,1]), u e ^/^([-l,!]), g o u G 
-^^([-1,1]) with e = (I){a,f3) but 50 n n'^' {[-1,1]) whatever 6' > 6. lla,P < 1, 
take g{x) = and u{y) = \y\^ for all x,y £ [—1, 1], if 1 < /3 and a < f3, take g = ha 
and u{y) = y for all y G [—1,1], finally, if a > 1 and a > /3, take g{x) = x for all 
X £ [—1, 1] and u = hp. 

6.8 Proof of Lemma [2] 

For all a > 0, the map defined for y in (0, +00) by 

S (y] = 1 = / ^ 1)"' if y > (a V l)-i; 
(t){a,l/y) \ oT^ otherwise, 

is positive, piecewise linear and convex. Hence, 

k 



1 

> 



and equality holds if and only if /3j < (a V 1) for all i or if for all i, /3j > (a V 1). We 
conclude by using the fact that (pia, z) < z{a A 1) for all positive number z and that 
equality holds if and only if 2; < a V 1. 
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Abstract 

We consider the problem of estimating a function s on [—1,1]'"' for large 
values of k by looking for some best approximation of s by composite functions 
of the form g o u. Our solution is based on model selection and leads to a very 
general approach to solve this problem with respect to many different types of 
functions g, u and statistical frameworks. In particular, we handle the problems of 
approximating s by additive functions, single and multiple index models, neural 
networks, mixtures of Gaussian densities (when s is a density) among other 
examples. We also investigate the situation where s = gou for functions g and u 
belonging to possibly anisotropic smoothness classes. In this case, our approach 
leads to a completely adaptive estimator with respect to the regularity of s. 

1 Introduction 

In various statistical problems, we have at hand a random mapping X from a mea- 
surable space {Q,A.) to (X, A:") with an unknown distribution Ps on X depending on 
some parameter s G 5 which is a function from [—1,1]'^ to M. For instance, s may 
be the density of an i.i.d. sample or the intensity of a Poisson process on [—1,1]'^ 

^ AMS 1991 subject classifications. Primary 62G05 

Key words and phrases. Curve estimation, model selection, composite functions. 
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or a regression function. The statistical problem amounts to estimating s by some 
estimator s = 's(X) the performance of which is measured by its quadratic risk, 



where d denotes a given distance on S. To be more specific, we shall assume in 
this introduction that X = is a sample of density (with s > 0) 

with respect to some measure fi and d is the Hellinger distance. We recall that, 
given two probabilities P, Q dominated by fi with respective densities / = dP/dfj, and 
g = dQ/dfj,, the Hellinger distance h between P and Q or, equivalently, between / 
and g (since it is independant of the choice of /i) is given by 



It follows that ^/2d[s, t) is merely the L2-distance between s and t. 

A general method for constructing estimators s is to choose a model S for s, i.e. 
do as if s belonged to S, and to build 's as an element of S. Sometimes the statistician 
really assumes that s belongs to S and that S is the true parameter set, sometimes 
he does not and rather considers S as an approximate model. This latter approach 
is somewhat more reasonable since it is in general impossible to be sure that s does 
belong to S. Given S and a suitable estimator 's, as those built in Birge (2006) for 
example, one can achieve a risk bound of the form 



where C is a universal constant (independent of s and S), 1^(5') the dimension of the 
model S (with a proper definition of the dimension) and r, which is equal to 1/n in 
the specific context of density estimation, characterizes the amount of information 
provided by the observation X. 

It is well-known that many classical estimation procedures suffer from the so-called 
"curse of dimensionality", which means that the risk bound (1.2) deteriorates when 
k increases and actually becomes very loose for even moderate values of k. This 
phenomenon is easy to explain and actually connected with the most classical way 
of choosing models for s. Typically, and although there is no way to check that such 
an assumption is true, one assumes that s belongs to some smoothness class (Holder, 
Sobolev or Besov) of index a and such an assumption can be translated in terms of 
approximation properties with respect to the target function s of a suitable collection 
of linear spaces (generated by piecewise polynomials, splines, or wavelets for example). 
More precisely, there exists a collection § of models with the following property: for 
all -D > 1, there exists a model S £ S with dimension D which approximates s with 
an error bounded by cD~"^'' for some c independent of D (but depending on s, a 
and k). With such a collection at hand, we deduce from (1.2) that whatever D > 1 
one can choose a model S = S{D) G S for which the estimator s' G S" achieves a 
risk bounded from above by C [c'^Z)"^"/'^ + tD] . Besides, by using the elementary 
Lemma 1 below to be proved in Section 5.6, one can optimize the choice of D, and 
hence of the model in S, to build an estimator whose risk satisfies 



R{s,s) = Es[d\s,s)] 




(1.1) 




(1.2) 




(1.3) 
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Lemma 1 For all positive numbers a, b and 9 and N* the set of positive integers, 
Jnf JaZ)-^ + bD}<b + mill [la'^^'+'h'^^'+'^-a] < max 

Since the risk bound (1.3) is achieved for D of order tends to 0, the 

deterioration of the rate -7-^"/^^°"'"^) when k increases comes from the fact that we 
use models of larger dimension to approximate s when k is large. Nevertheless, this 
phenomenon is only due to the previous approach based on smoothness assumptions 
for s. An alternative approach, assuming that s can be closely approximated by suit- 
able parametric models the dimensions of which do not depend on k would not suffer 
from the same weaknesses. More generally, a structural assumption on s associated 
to a collection of models the approximation properties of which improve on those 
of S, can only lead to a better risk bound and it is not clear at all that assuming 
that s belongs to a smoothness class is more realistic than directly assuming approx- 
imation bounds with respect to the models of Such structural assumptions that 
would amount to replacing the large models involved in the approximation of smooth 
functions by simpler ones have been used for many years, especially in the context of 
regression. Examples of such structural assumptions are provided by additive models, 
the single index model, the projection pursuit algorithm introduced by Friedman and 
Tuckey (1974), (an overview of the procedure is available in Huber (1985)) and arti- 
ficial neural networks as in Barron (1993; 1994), among other examples. It actually 
appears that a large number of these alternative approaches (in particular those we 
just cited) can be viewed as examples of approximation by composite functions. 

In any case, an unattractive feature of the previous approach based on an a priori 
choice of a model S" G S is that it requires to know suitable upper bounds on the 
distances between s and the models S in S. Such a requirement is much too strong 
and an essential improvement can be brought by the modern theory of model selection. 
More precisely, given some prior probability vr on S, model selection allows to build 
an estimator 's with a risk bound 

CR{s,s) < inf {inf rf2(s, t) + r [V{S) + log (1/^(5))] | , (1.4) 

for some universal constant C > 0. If we neglect the influence of log (l/7r(S')), which is 
connected to the complexity of the family S of models we use, the comparison between 
(1.2) and (1.4) indicates that the method selects a model in S leading approximately 
to the smallest risk bound. 

With such a tool at hand that allows us to play with many models simultaneously 
and let the estimator choose a suitable one, we may freely introduce various models 
corresponding to various sorts of structural assumptions on s that avoid the "curse 
of dimensionality". We can, moreover, mix them with models which are based on 
pure smoothness assumptions that do suffer from this dimensional effect or even with 
simple parametric models. This means that we can so cumulate the advantages of 
the various models we introduce in the family S. 

The main purpose of this paper is to provide a method for building various sorts 
of models that may be used, in conjonction with other ones, to approximate functions 
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on [—1, 1]*^ for large values of k. The idea, which is not new, is to approximate the 
unknown s by a composite function gou where g and u have different approximation 
properties. If, for instance, the true s can be closely approximated by a function gou 
where u goes from [—1, 1]^ to [—1, 1] and is very smooth and g, from [—1, 1] to M, 
is rough, the overall smoothness of o u is that of g but the curse of dimensionality 
only applies to the smooth part u, resulting in a much better rate of estimation than 
what would be obtained by only considering ^ o u as a rough function from [—1,1]'^ 
to M. This is an example of the substantial improvement that might be brought by 
the use of models of composite functions. 

Recent works in this direction can be found in Horowitz and Mammen (2007) 
or Juditsky, Lepski and Tsybakov (2009). Actually, our initial motivation for this 
research was a series of lectures given at CIRM in 2005 by Oleg Lepski about a 
former version of this last paper. There are, nevertheless, major differences between 
their approach and ours. They deal with estimation in the white noise model, kernel 
methods and the Lqo-Ioss. They also assume that the true unknown density s to be 
estimated can be written as s = gou where g and u have given smoothness properties 
and use these properties to build a kernel estimator which is better than those based 
on the overall smoothness of s. The use of the Lqo-Ioss indeed involves additional 
difficulties and the minimax rates of convergence happen to be substantially slower 
(not only by logarithmic terms) than the rates one gets for the L2-I0SS, as the authors 
mention on page 1369, comparing their results with those of Horowitz and Mammen 
(2007). 

Our approach is radically different from the one of Juditsky, Lepski and Tsybakov 
and considerably more general as we shall see, but this level of generality has a price. 
While they provide a constructive estimator that can be computed in a reasonable 
amount of time, although based on supposedly known smoothness properties of g and 
n, we offer a general but abstract method that applies to many situations but does 
not provide practical estimators, only abstract ones. As a consequence, our results 
about the performance of these estimators are of a theoretical nature, to serve as 
benchmarks about what can be expected from good estimators in various situations. 

We actually consider "curve estimation" with an unknown functional parameter 
s and measure the loss by L2-type distances. Our construction applies to various 
statistical frameworks (not only the Gaussian white noise but also all these for which 
a suitable model selection theorem is available). Besides, we do not assume that 
s = g o u but rather approximate s by functions of the form gou and do not fix in 
advance the smoothness properties of g and u but rather let our estimator adapt to 
it. In order to give a simple account of our result, let us focus on pairs (u, g) with 
u mapping [—1,1]'^ into [—1,1] and g [—1, 1] into R. In this case, our main theorem 
says the following: consider two (at most) countable collections of models T and F, 
endowed with the probabilities A and 7 respectively, in order to approximate such 
functions u and g respectively. There exists an estimator 's such that, whatever the 
choices of u and g with g at least L-Lipschitz for some L > 0, 

C'{L)R{s,s) < d\s,gou)+ini[\nidUgJ) + T[V{F) + \o^{lh{F))]\ 

+ mf |infd2(u,t) + r[P(r)logT-i+log(l/A(r))]|, (1.5) 
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where doo denotes the distance based on the supremum norm. Compared to (1.4), 
this result says that, apart from the extra logarithmic terms and the constant C 
depending on L, if s were of the form g o u the risk bound we get for estimating 
s is the maximum of those we would get for estimating g and u separately from a 
model selection procedure based on (F, 7) and (T, A) respectively. A more general 
version of (1.5) allowing to handle less regular functions g and multivariate functions 
u = {ui, . . . ,ui) with values in [—1, 1]' is available in Section 3. As a consequence, 
our approach leads to a completely adaptive method with many different possibilities 
to approximate s. It allows, in particular, to play with the smoothness properties of 
g and u or to mix purely parametric models with others based on smooth functions. 
Since methods and theorems about model selection are already available, our main 
task here will be to build suitable models for various forms of composite functions 
g o u and check that they do satisfy the assumptions required for applying previous 
model selection results. 

2 Our statistical framework 

We observe a random element X from the probability space {Q,A,¥s) to (X, X) with 
distribution Pg on X depending on an unknown parameter s. The set S of possible 
values of s is a subset of some space L,q{E,fi) where /i is a given probability on the 
measurable space {E,£). We shall mainly consider the case q = 2 even though one 
can also take g = 1 in the context of density estimation. We denote by d the distance 
on Lq(£', ;u) corresponding to the Lq(£', ;u)-norm || • \\q (omitting the dependency of d 
with respect to q) and by the expectation with respect to so that the quadratic 
risk of an estimator s is Kg [d^ {s,^) ] . The main objective of this paper, in order to 
estimate s by model selection, is to build special models S that consist of functions of 
the form f ot where t = {ti ... ,ti) is a mapping from £^ to / C , / is a continuous 
function on / and / = rij=i ^ product of compact intervals of M. Without loss 
of generality, we may assume that / = [—1, 1]'. Indeed, if / = 1, t takes its values in 
Ii = 1/3 — a, /3 + a], a > and / is defined on /i, we can replace the pair (/, t) by 
{f,t) where t{x) = a~^[t{x) — f3] and f{y) = f{ay + /3) so that i takes its values in 
[—1,1] and f o t = f o t. The argument easily extends to the multidimensional case. 

2.1 Notations and conventions 

To perform our construction based on composite functions / o t, we introduce the 
following spaces of functions : T C I^q{E, n) is the set of measurable mappings from 
E' to [—1,1], Ti^oo is the set of bounded functions on [—1, 1]' endowed with the distance 
doo given by doo{f,g) = sup^-gf.^ i]i 1/(2;) - g{x)\ and Ji,c is the subset of ^7,00 which 
consists of continuous functions on [—1,1]'. We denote by N* (respectively, M^) the 
set of positive integers (respectively positive numbers) and set 

[z\ = sup{j G Z I j < z} and \z] = mf{j G N* | j > z}, for all z £R. 

The numbers x Ay and x \/ y stand for min{a;,y} and max{x,y} respectively and 
log^(x) stands for (logx) V 0. The cardinality of a set A is denoted by 1^41 and, by 
convention, "countable" means "finite or countable" . We call subprobability on some 
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countable set A any positive measure vr on ^ with n{A) < 1 and, given tt and a G A, 
we set 7r(a) = 7r({a}) and A-,^{a) = — log(7r(a)) with the convention A7r(a) = +00 
if 7r(a) = 0. The dimension of the hnear space V is denoted by ^{V). Given a 

o 

compact subset K of M with Ky^ 0, we define the Lebesgue probability fi on K hy 
fj.{A) = X{A)/X{K) for A C K, where A denotes the Lebesgue measure on M'^. 

For X G M™, Xj denotes the j^^ coordinate of x {1 < j < m) and, similarly, 
Xjj denotes the j^^ coordinate of Xj if the vectors Xj are already indexed. We set 
\x\^ = Yl^=i fo'^ squared Euclidean norm of x € M™', without reference to the 
dimension m, and denote by Bm the corresponding closed unit ball in M™. Similarly, 
l-^loo ~ ™ax{|xi|, . . . , |xm|} for all x £ W^. For x in some metric space {M,d) 
and r > 0, B{x, r) denotes the closed ball of center x and radius r in M and for 
A C M, d{x,A) = infyg^ y). Finally, C stands for a universal constant while 
C is a constant that depends on some parameters of the problem. We may make 
this dependence explicit by writing C(a, b) for instance. Both C and C are generic 
notations for constants that may change from line to line. 

2.2 A general model selection result 

General model selection results apply to models which possess a finite dimension in 
a suitable sense. Throughout the paper, we assume that in the statistical framework 
we consider the following theorem holds. 

Theorem 1 Let S be a countable family of finite dimensional linear subspaces S of 
'Lq{E,jjL) and let n be some subprobability measure on §. There exists an estimator 
s = with values in UsesS satisfying, for all s G S, 

E, [d^ (s, s)]<C inf {d^ {s, S) + T [{V{S) V 1) + A^{S)] } , (2.1) 

S 

where the positive constant C and parameter r only depend on the specific statistical 
framework at hand. 

Similar results often hold also for the loss function d'^{s,'s) (r > 1) replacing d'^{s,'s). 
In such a case, the results we prove below for the quadratic risk easily extend to the 
risk Eg [d'^(s,'s)]. For simplicity, we shall only focus on the case r = 2. 

2.3 Some illustrations 

The previous theorem actually holds for various statistical frameworks. Let us provide 
a partial list. 

Gaussian frameworks A prototype for Gaussian frameworks is provided by some 
Gaussian isonormal linear process as described in Section 2 of Birge and Massart (2001) 
In such a case, X is a Gaussian linear process with a known variance r, indexed by 
a subset S of some Hilbert space l^2iE,fi). This means that s £ S determines the 
distribution Pg. Regression with Gaussian errors and Gaussian sequences can both 
be seen as particular cases of this framework. Then Theorem 1 is a consequence of 
Theorem 2 of Birge and Massart (2001). In the regression setting, Baraud, Giraud 
and Huet (2009) considered the practical case of an unknown variance and proved 
that (2.1) holds under the assumption that "D^S) V A^(5) < n/2 for all 5 G S. 
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Density estimation Here X = {Xi, . . . ,Xn) is an n-sample with density with 
respect to ^ and S is the set of nonnegative elements of norm 1 in L2(-E',/x). Then 
d{s,t) = -v/2^('S^,i^) where h denotes the Helhnger distance between densities de- 
fined by (1.1), T = and Theorem 1 follows from Theorem 6 of Birge (2006) or 
Corollary 8 of Baraud (2011). Alternatively, one can take for s the density itself, for 
S the set of nonnegative elements of norm 1 in l^i{E,fj,) and set q = 1. The result 
then follows from Theorem 8 of Birge (2006). Under the additional assumption that 
s E L2(£', /.f) nLoo(-E', /u), the case q = 2 follows from Theorem 6 of Birge (2008) with 

T = "-"M|s|loo(l Vlog||s||^). 

Regression with fixed design We observe X = {(xi,Yi), . . . , {xn,Yn)} with 
E[li] = s{xi) where s is a function from E = {xi,...,Xn} to M and the errors 
£i = Yi — s{xi) are i.i.d. Here fi is the uniform distribution on E, hence (P{s,t) = 
^^j^[s(xi) — t(xi)]^ and r = 1/n. When the errors Si are subgaussian, Theorem 1 
follows from Theorem 3.1 in Baraud, Comte and Viennet (2001). For more heavy- 
tailed distributions (Laplace, Cauchy, etc.) we refer to Theorem 6 of Baraud (2011) 
when s takes its values in [—1, 1]. 

Bounded regression with random design Let {X, Y) be a pair of random vari- 
ables with values in E x [—1, 1] where X has distribution /i and E[y|X = x] = s{x) 
is a function from E to [—1, 1]. Our aim here is to estimate s from the observation 
of n independent copies X = {{Xi,Yi), . . . , (X„,y„)} of {X,Y). Here the distance 
d corresponds to the L2(i?, /i)-distance and Theorem 1 follows from Corollary 8 in 
Birge (2006) with t = n"\ 

Poisson processes In this case, X is a Poisson process on E with mean measure 
• 11, where s is a nonnegative element of L,2{E,fi). Then r = 1 and Theorem 1 
follows from Birge (2007) or Corollary 8 of Baraud (2011). 

3 The basic theorems 

3.1 Models and their dimensions 

If we assume that the unknown parameter s to be estimated is equal or close to some 
composite function of the form gou with u ^ and g G ^ and if we wish to estimate 
gouhy model selection we need to have at disposal a family F of models for approxi- 
mating g and families Tj, 1 < j < /, to approximate the components uj of u. Typical 
sets that are used for approximating elements of J-i^c or are finite-dimensional 
linear spaces or subsets of them. Many examples of such spaces are described in 
books on Approximation Theory, like the one by DeVore and Lorentz (1993) and we 
need a theorem which applies to such classical approximation sets for which it will 
be convenient to choose the following definition of their dimension. 

Definition 1 Let H be a linear space and S C H. The dimension 'D{S) G NU {oo} 
of S is if \S\ = 1 and is, otherwise, the dimension (in the usual sense) of the linear 
span of S. 
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3.2 Some smoothness assumptions 

In order to transfer the approximation properties of 5 by / and u by t into approx- 
imation of g o II by / o i, we shall also require that g be somewhat smooth. The 
smoothness assumptions we need can be expressed in terms of moduli of continuity. 
We start with the definition of the modulus of continuity of a function g in Ti^c- 

Definition 2 We say that w from [0,2]' to Mf(_ is a modulus of continuity for a 
continuous function g on [—1,1]' if, for all z £ [0,2]', w{z) is of the form w{z) = 
(wi{zi), . . . ,wi(zi)) where each function Wj with j = is continuous, nonde- 

creasing and concave from [0,2] to M+, satisfies Wj{0) = 0, and 

I 

\9{x) -g{y)\ < -yj\) for allx,y G [-1,1]'. 

j=i 

For a e (0,1]' and L G (0, +00)', we say that g is {oL,'L)-Hdlderian if one can take 
Wj{z) = LjZ'^i for all z G [0, 2] and j = 1, . . . ,1. It is said to be Jj-Lipschitz if it is 
(a, 'L)-Hdlderian with a = (1, . . . , 1). 

Note that our definition of a modulus of continuity implies that the Wj are subadditive, 
a property which we shall often use in the sequel and that, given g, one can always 
choose for Wj the least concave majorant of Wj where 

Wj{z)= sup \g{x) - g{xi,... ,Xj-i,Xj + z,Xj+i,... ,xi)\. 

xe[-l,l]'; Xj<l~z 

Then Wj(z) < 2wj{z) according to Lemma 6.1 p. 43 of DeVore and Lorentz (1993). 

3.3 The main theorem 

Our construction of estimators 's oi g ou will be based on some set <3 of the following 
form: 

6 = {/,F,7,Ti,...,T,,Ai,...,Aa, / G r, (3.1) 

where F, Ti, . . . ,T/ denote families of models and 7, Aj are measures on F and Tj 
respectively. In the sequel, we shall assume that & satisfies the following requirements. 

Assumption 1 The set & is such that 

i) the family ¥ is a countable set and consists of finite- dimensional linear subspaces 
F of Ti^oo with respective dimensions T){F) > 1, 

ii) for j = 1, . . . ,1, Tj is a countable set of subsets of l^q{E, fi) with finite dimen- 
sions, 

Hi) the measure j is a subprobability on F, 

iv) for j = 1, . . . ,1, Xj is a subprobability on Tj. 

Given 6, one can design an estimator ?with the following properties. 

Theorem 2 Assume that Theorem 1 holds and that S satisfies Assumption 1. One 
can build an estimator s = s(X) satisfying, for all u and g G Fi^c with modulus 
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of continuity Wg, 



CEs[d^ {s,s))] < J2M {lwl^{d{u„T))+T[Ax^{T) + i{g,j,T)V{T)]} 



+ d\s, gou)+ madlig, F) + r [A^{F) + V{F)]}, (3.2) 

where i{g,j,T) = 1 ifT>{T) = and otherwise, 

i{gJ,T) = mf{i G N* | Iw^j {e'') < TiV{T)} < +oo. (3.3) 

Note that, since the risk bound (3.2) is valid for all g G J-i^c a.nd u E 7"', we can 
minimize the right-hand side of (3.2) with respect to g and u in order to optimize the 
bound. The proof of this theorem is postponed to Section 5.4. 

Of special interest is the case where g is L-Lipschitz. If one is mainly interested by 
the dependence of the risk bound with respect to r as it tends to 0, one can check that 
i{g,j,T) < logr~^ for r small enough (depending on / and L) so that (3.2) becomes 
for such a small r 

I 

C'Es[d' {s,s)] < ^ mf {d2(u„r) + r(AA,(r)+P(T)logr-i)} 
i=i ' 

+d\s, gou)+ inf {dlig, F) + r [V{F) + A^(F)]} . 

t €lr 

If it were possible to apply Theorem 1 to the models F with the distance doo and 
the models T with the distance d for each j separately, we would get risk bounds of 
this form, apart from the value of C and the extra logr"^ factor. This means that, 
apart from this extra logarithmic factor, our procedure amounts to performing / + 1 
separate model selection procedures, one with the collection F for estimating g and 
the other ones with the collections Tj for the components Uj, finally getting the sum 
of the / + 1 resulting risk bounds. The result is however slightly different when g is no 
longer Lipschitz. When g is (a,L)-Holderian then one can check that i((7, j, T) < Cj^x 
where Cj^t = 1 if ^(T) = and, if V{T) > 1, 



hT = 
< 



' a^' log {IL][tV{T)]-')\\I I (3.4) 

C"(/,a,) [log(r-i) Vlog(L2/P(r)) V l] . (3.5) 

In this case. Theorem 2 leads to the following result. 

Corollary 1 Assume that the assumptions of Theorem 2 holds. For all (q:,L)- 
Hdlderian function g with a € (0, 1]' and L G (1R+)S the estimator s of Theorem 2 
satisfies 

I 

CE,[d\s,s)] < Y.mf{lL]d^''^{u„T)+T[A^^iT)+V{T)q,T]} 

j=i ^^^^ 

+ d\s, gou)+ inf {dlig, F) + r [A,{F) + P(F)]}, (3.6) 
where Cj^t is defined by (3.4) and bounded by (3.5). 
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3.4 Mixing collections corresponding to different values of / 

If it is known that s takes the special form g o u for some unknown values oi g £ Ti^c 
and w E or if s is very close to some function of this form, the previous approach is 
quite satisfactory. If we do not have such an information, we may apply the previous 
construction with several values of / simultaneously, approximating s by different 
combinations gi o ui with ui taking its values in [—1, 1]', gi a function on [—1, 1]' and 
/ varying among some subset / of N*. To each value of I we associate, as before, / + 1 
collections of models and the corresponding subprobabilities, each I then leading to an 
estimator s/ the risk of which is bounded by Tl(si, gi,ui) given by the right-hand side 
of (3.2). The model selection approach allows us to use all the previous collections of 
models for all values of I simultaneously in order to build a new estimator the risk of 
which is approximately as good as the risk of the best of the s;. More generally, let 
us assume that we have at hand a countable family {&£, i £ 1} of sets &£ of the form 
(3.1) satisfying Assumption 1 for some I = > 1. To each such set. Theorem 2 
associates an estimator with a risk bounded by 

[d? {s, se) ] < inf 7^(?£, g, u) , 

where TZ{'s£,g,u) denotes the right-hand side of (3.2) when (3 = (3^ and the infimum 
runs among all pairs {g,u) with g G ^i{i),c ^'Hd u G T^^^^ . We can then prove (in 
Section 5.5 below) the following result. 

Theorem 3 Assume that Theorem 1 holds and let I he a countable set and v a 
subprobability on I. For each £ £ I we are given a set &£ of the form (3.1) that 
satisfies Assumption 1 with I = l{i) and a corresponding estimator 'si provided by 
Theorem 2. One can then design a new estimator s = s( X) satisfying 

CE,[d2(s,?)] <inf inf {7^(S■^, g, u) + rA,(^)} , 

i&I (g,u) 

where 7^(s),(7, u) denotes the right-hand side of (3.2) when © = ©£ and the second 
infimum runs among all pairs {g,u) with g G ^i(i),c <^^d u G T"'^^-*. 

3.5 The main ideas underlying our construction 

Let us assume here that p = q = 2 and E = [—1, 1]'' with k > I > 1. Our approach 
is based on the construction of a family of linear spaces with good approximation 
properties with respect to composite functions gou. More precisely, if one considers a 
finite dimensional linear space F C J-^oo for approximating g and compact sets Tj C T 
for approximating the Uj, we shall show (see Proposition 4 in Section 5.1 below) that 
there exists some t in T = nj=i such that the linear space 5t = {/ o 1 1 / G F} 
approximates the composite function gou with an error bound 

I 

d{g ou,St) < doo{g,F) + V2Y^Wgj {d{uj,Tj)) . (3.7) 

i=i 

The case where the function g is Lipschitz, i.e. Wgj{x) = Lx for all j, is of partic- 
ular interest since, up to constants, the error bound we get is the sum of those for 
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approximating separately g hy F (with respect to the Loo-distance) and the uj by 
Tj. In particular, if s were exactly of the form s = g o u for some known functions 
Uj, we could use a linear space F of piecewise constant functions with dimension of 
order D to approximate and take Tj = {uj} for all j. In this case the linear space 
Su whose dimension is also of order D would approximate s = g o u with an error 
bounded by D~^/K Note that if the Uj were ah L)-H61derian with (3 £ (0, 1]'', the 
overall regularity of the function s = g o u could not be expected to be better than 
/9-H61derian, since this regularity is already achieved by taking g{yi, . . . ,yi) = yi. In 
comparison, an approach based on the overall smoothness of s, which would com- 
pletely ignore the fact that s = g o u and the knowledge of the Uj , would lead to an 

approximation bound of order D~^/^ with (3 = k ^X]j=i • '^^^ former bound, 

D~^l\ based on the structural assumption that s = g o u therefore improves on the 
latter since P < 1 and k > I. Of course, one could argue that the former approach 
uses the knowledge of the uj, which is quite a strong assumption. Actually, a more 
reasonable approach would be to assume that u is unknown but close to a parametric 
set T, in which case, it would be natural to replace the single model Su used for 
approximating s, by the family of models = {St \ t G T} and, ideally, let the 

usual model selection techniques select some best linear space among it. Unfortu- 
nately, results such as Theorem 1 do not apply to this case, since the family §7p(i^) 
has the same cardinality as T and is therefore typically not countable. The main 
idea of our approach is to take advantage of the fact that the Uj take their values in 
[—1,1] so that we can embed T into a compact subset of T' . We may then introduce 
a suitably discretized version T of T (more precisely, of its embedding) and replace 
the ideal collection S;jr(F) by 'S>t^{F), for which similar approximation properties can 
be proved. The details of this discretization device will be given in the proofs of our 
main results. Finally, we shall let both T and F vary into some collections of models 
and use all the models of the various resulting collections St(-^) together in order to 
estimate s at best. 

4 Applications 

The aim of this section is to provide various applications of Theorem 2 and its corol- 
laries. We start with a brief overview of more or less classical collections of models 
commonly used for approximating smooth (and less smooth) functions on [—1, 1]*^. 

4.1 Classical models for approximating smooth functions 

Along this section, d denotes the L2-distance in L2([— 1, 1]^ ,2~^dx), thus taking q = 2, 
E = [—1,1]^ and jj, the Lebesgue probability on E. Collections of models with the 
following property will be of special interest throughout this paper. 

Assumption 2 For each D G N the number of elements with dimension D belonging 
to the collection S is bounded by exp[c(S)(D + 1)] for some nonnegative constant c(S) 
depending on S only. 
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4.1.1 Approximating functions in Holder spaces on [—1,1]*^ 



When k = 1, a typical smoothness condition for a function s on [—1,1] is that it 
belongs to some Holder space T-L'^{[—1, 1]) with a = r + a',r£'N and < a' < 1 
which is the set of all functions / on [—1, 1] with a continuous derivative of order r 
satisfying, for some constant L{f) > 0, 

<L(/)|^_y|"' for all X, ye [-1,1]. 

This notion of smoothness extends to functions /(xi, . . . defined on [—1, 1]'^, by 
saying that / belongs to T-L^d—l, 1]'^) with a = (qi, . . . , ak) G (0, +00)''' if, viewed as 
a function of Xi only, it belongs to ^"*([— 1, 1]) for 1 < i < k with some constant L(f) 
independent of both i and the variables Xj for j 7^ i. The smoothness of a function s 
in H^d—l, 1]^) is said to be isotropic if the are all equal and anisotropic otherwise, 
in which case the quantity a given by 

1 - i V — 

a k ^ ai 

1=1 

corresponds to the average smoothness of s. It follows from results in Approximation 
Theory that functions in the Holder space 'H*-*^([— 1, l]'^) can be well approximated 
by piecewise polynomials on /c-dimensional hyperrectangles. More precisely, our next 
proposition follows from results in Dahmen, DeVore and Scherer (1980). 

Proposition 1 Let {k,r) € N* x N. There exists a collection of models Hfc j. = 
[jjj^^Mk^riD) satisfying Assumption 2 such that for each positive integer D, the 
family Mk^r{D) consists of linear spaces S with dimensions T)[S) < C'i{k,r)D spanned 
by piecewise polynomials of degree at most r on k- dimensional hyperrectangles and 
for which 

inf d(s,S)< inf doo(s, 5) < r)L(s)Z)-"/^ 



for all s G 'H'^{[—1, 1]'^) with sup;^<j<;j < r + 1. 



4.1.2 Approximating functions in anisotropic Besov spaces 

Anisotropic Besov spaces generalize anisotropic Holder spaces and are defined in a 
similar way by using directional moduli of smoothness, just as Holder spaces are de- 
fined using directional derivatives. To be short, a function belongs to an anisotropic 
Besov space on [—1, 1]'' if, when all coordinates are fixed apart from one, it belongs 
to a Besov space on [—1,1]. A precise definition (restricted to /c = 2 but which 
can be generalized easily) can be found in Hochmuth (2002). The general defini- 
tion together with useful approximation properties by piecewise polynomials can be 
found in Akakpo (2012). For < p < +00, k > I and (3 € (0,+cx))'', let us de- 

note by i3^p([— 1, 1]'^) the anisotropic Besov spaces. In particular, i3So,oo([— 1, 1]*^) = 
nf^{[-l,l]''). It follows from Akakpo (2012) that Proposition 1 can be generalized 
to Besov spaces in the following way. 
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Proposition 2 Let p > 0, k G and r G N. There exists a collection of models 
'^k,r = Ud>i ^'^:^(-^) satisfying Assumption 2 such that for each positive integer D, 
Bfc^r(-D) consists of linear spaces S with dimensions T>[S) < C[{k,r)D spanned by 
piecewise polynomials of degree at most r on k-dimensional hyperrectangles and for 
which _ 

inf d{s, S) <C'^{k,r,p)\s\Q p^^l^ 

for all s G B^^p{[-1, if) with semi-norm and (3 satisfying 

supft<r + l and ^5 > /c [(p"^ - 2^^) V O] . (4.1) 

l<i<fc 

4.2 Estimation of smooth functions on [—1,1]'^ 

In this section, our aim is to estabhsh risk bounds for our estimator 's when s = gou 
for some smooth functions g and u. We shaU discuss the improvement, in terms of 
rates of convergence as r tends to 0, when assuming such a structural hypothesis, as 
compared to a pure smoothness assumption on s. Throughout this section, we take 
g = 2, £'=[— 1,1]^ and d as the L2-distance on L2(-E', 2~^dx). 

It follows from Section 4.1 that, for all r > 0, Hfc satisfies Assumption 2 for some 
constant c(EIfc ,,). Therefore the measure 7 on H^.^^ defined by 

A^(5) = (c(Hfc,,) + l)(D + l) for ah 5 G EIfc,,(Z)) \ J IHIfc,,(Z)') (4.2) 

1<D'<D 

is a subprobability since 

^ e-A,(5) < ^ g-D |efe^,,(i?)| e~'=(=*-)(^+i) < ^ e-^ < 1. 

SeHfc^r D>1 D>1 

We shall similarly consider the subprobability A defined on B^. by 

Ax{S) = {c(Mk,r) + l){D + l) for all 5 G Bfc,,(i?) \ [J Bfc,,(D')- (4-3) 

1<D'<D 

Finally, for g G ^«=([-l,l]0 = B^^^i[-l,lf) with a G (M^)', we set M^^^ = 
|5|q;,oo,oo + inf L' where the infimum runs among all numbers L' for which Wgj{z) < 
L'^ajM for ah z G [0, 2] and j = 1, . . . , L 



4.2.1 Convergence rates using composite functions 

Let us consider here the set Sk,i{oi, (3,p, L, R) gathering the composite functions 

gou with g G ^'^([—1, 1]') satisfying \\g\\Q, 00 — ^ ^^"^ ""j ^ ^p/^Pj with semi-norms 

\uj\a < Rj for all j = 1, . . . , /. The following result holds. 

l-'j'Pj'Pj ■' 

Theorem 4 Assume that Theorem 1 holds with q = 2. There exists an estimator s 
such that, for all l>l,a,R€ {R\y , L > 0, p^, . . . , € (M^)'' and p G (0, +00]^ 
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with (3j > k 



p.^ - 2-M V 



for 1 < j < I, 



sup C'E.s[d^{s,s) 
seSk,i{a,^,p,L,R) 



< 



ajAl\ 2/3 (a.Al) + fe 



[rC] 



2,9.{a Al) + fc 



where C = log(r ^) V log(L^) V 1 and C depends on k, I, a, (3 and p. 

Let us recall that we need not assume that s is exactly of the form gou but rather, as 
we did before, that s can be approximated by a function s = gou G Sk,i{oc, (3, p, L, R). 
In such a case we simply get an additional bias term of the form d^ is,s) in our risk 
bounds. 

Proof: Let us fix some value of Z > 1 and take s = g o u £ Sk^iicx, P,p, L, R) and 
define 



r{a,l3) = 1 + 



max Ui \/ 



max /3,- i 
..,Le=i,...,k 



The regularity properties of g and the Uj together with Propositions 1 and 2 imply 
that for all > 1, there exist F G ^i^r{D) and sets Tj G Mk^riD) for j = 1, . . . ,1 such 
that 

V{F) < C[{l,a,(3)D; d^{g,F) < C7^(/, a, /3) LD""/'; 
and, for I < j < I, 

V{Tj) < C'^{k,cx,(3^,pj) D; d{uj,Tj) < C'^{k,a, p^,pj) RjD~^^/K 

Since the collections H/^j. and satisfy Assumption 2 and Wgj{z) < Lz"^^^ for all 
j and z G [0, 2], we may apply Corollary 1 with 



, B^^^, Xj., . . . , Af ) 



the subprobabilities 7;^^ and A^^^ being given by (4.2) and (4.3) respectively. Besides, 
it follows from (3.5) that Cj^t < C'{1, a)C for all j, so that (3.6) implies that the risk 
of the resulting estimator r is bounded from above by 



■^2^2{a,Al)^„2(a,Al)/3,/fc + DtC 



+ inf 

D>1 L 



for some constant C depending on l,k,a.,Pi, . . . ,/3;. We obtain the result by opti- 
mizing each term of the sum with respect to D by means of Lemma 1 , and by using 
Theorem 3 with u defined for £ = (/,r) G N* x N by 1^(1, r) = e"^'"''^"''^) for which 
A^{1, r)T < {I + r + l)Tl{'si^r, 9, u) for all /, r. [] 



4.2.2 Structural assumption versus smoothness assumption 

In view of discussing the interest of the risk bounds provided by Theorem 4, let us 
focus here, for simplicity, on the case where g G 'H°'{[—1, 1]) with a > (hence / = 1) 
and n is a function from E = [—1,1]'"' to [—1,1] that belongs to 

nf^{[-l,l]'') with 
/3 G (M^)'^. The following proposition is to be proved in Section 5.7. 
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Proposition 3 Let (p be the function defined on (M!^)^ by 

If \ i xy if xV y < 1: 
d)(x, y) = < ^ , 

y X f\y otherwise. 

For all k>l,a>0,pe {RXf> 9 ^ ^"([-1. 1]) and u G 1]''), 

gou €n^{[-l,lf) with Oi = (I){f3i,a) for 1 < i < k. (4.4) 

Moreover, 6 is the largest possible value for which (4-4) holds for all g G 1, 1]) 

and u G '^^([—1,1]'^) since, whatever 6' £ (M^)'' such that 6'- > 9i for some i G 
{1, . . . , k}, there exists some g G ?^"([— 1, 1]) and u G 1, 1]'^) such that g o u ^ 

Using the information that s belongs to {[—1, 1]^) with 6 given by (4.4) and that 
we cannot assume that s belongs to some smoother class (although this may happen 
in special cases) since is minimal, but ignoring the fact that s = g o u, we can 
estimate s at rate 7-26/(26+^) ^g^g j. tends to 0) while, on the other hand, by using 
Theorem 4 and the structural information that s = g o u, we can achieve the rate 

^2a/{2a+l) ^ [logT^^] ) 2^(°^l)/(2^{aAl)+fc) _ 

Let us now compare these two rates. First note that it follows from (4.4) that 9i < a 
for all i, hence 9 < a and, since A; > 1, 2a/ {2a + 1) > 29/ {29 + k). Therefore the 
term 7-20/(20+1) always improves over ';-2^/(2e+fc) .^j^gj^ ^ jg g^Qg^jj ^nd, to compare the 
two rates, it is enough to compare 9 with /3(a A 1). To do so, we use the following 
lemma (to be proved in Section 5.8). 

Lemma 2 For all a > and (3 G (M^)'^, the smoothness index 

0= (</)(«, /3i ),..., 0(a,/3fc)) 

satisfies 9 < f3{a A 1) and equality holds if and only if supx<j<fc /3i < a V 1. 

When sup]^<j<;j /3i < a V 1, our special strategy does not bring any improvement as 
compared to the standard one, it even slightly deteriorates the risk bound because of 
the extra logr~^ factor. On the opposite, if supi<j<fc /3j > a V 1, our new strategy 
improves over the classical one and this improvement can be substantial if /? is much 
larger than a V 1. If, for instance, a = 1 and f3 = k = /3j for all j, we get a bound of 

order [r (logr"^)]^^^ which, apart from the extra logr~^ factor, corresponds to the 
minimax rate of estimation of a Lipschitz function on [—1,1], instead of the risk bound 
^2/(2+fc) f^i^Q^ would get if we estimated s as a Lipschitz function on [—1,1]'^. When 
our strategy does not improve over the classical one, i.e. when supi<j<fc /3j < a VI, the 
additional loss due to the extra logarithmic factor in our risk bound can be avoided by 
mixing the models used for the classical strategy with the models used for designing 
our estimator, following the recipe of Section 3.4. 
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4.3 Generalized additive models 

In this section, we assume that E = [—1,1]^ , is the Lebesgue probabihty on E and 
q = 2. A special structure that has often been considered in regression corresponds 
to functions s = g o u with 

u(xi, ...,Xk) = ui(xi) + . . . + Uk{xk); s{x) = g (tii(xi) + . . . + Uk{xk)) , (4.5) 

where the Uj take their values in [— l/Zc] for all j = l,...,/c. Such a model 
has been considered in Horowitz and Mammen (2007) and while their approach is 
non- adaptive, ours, based on Theorem 2 and a suitable choice of the collections of 
models, allows to derive a fully adaptive estimator with respect to the regularities 
of g and the Uj. More precisely, for r € N, let be the collection of all models 
of the form T = Ti + . . . + Tk where for j = 1, . . . , /c, Tj is the set of functions of 
the form x i— )• tj{xj) with x £ E and tj in Bi^,.. Using A,. = A as defined by (4.3), 

(k) ' 

we endow T,. with the subprobability Ar defined for T G by the infimum of the 
quantities Y[i=i ^r{Ti) when (Ti, . . . ,Tfc) runs among all the /c-uplets of B^^ satisfying 
T = Ti + . . . + Tfc. Finally, for a, L > 0, /3, G (M:^)'' and p = (pi, . . . ,pk) G (M* )'=, 
let S^'^'^{a,l3,Y>.L,R) be the set of functions of the form (4.5) with g G ^"([-1, 1]) 
satisfying WgW^^^ < L and Uj G Bp]^p.{[-l,l]) with \uj\p,^^,^p, < Rjk~^ for ah j = 

l,...,k. Using the sets Gr = (1, Hi^j., 7^, T^, Ar ) with r G N we can build an 
estimator with the following property. 

Theorem 5 Assume that Theorem 1 holds with q = 2. There exists an estimator s 
which satisfies for all a,L > 0, p,R G {M'^)'' and f3 G (M:^)^ with (3j > {l/pj - 1/2) 
for all] = l,...,k, 

sup C'¥.s[d^{s,s)\ 
i=i 

where C = log {t"^) V log (L^) V 1 and C is a constant that depends on a,/3,p and 
k only. 

If one is mainly interested in the rate of convergence as r tends to 0, the bound we get 
is of order max{T2"/(2"+i), [Tlog(T-i)]2("^iW(2(«Ai)/3+i)} ^here /3 = min{/3i, . . . 

In particular, if q > 1, this rate is the same (up to a logarithmic factor) as that we 

would obtain for estimating a function on [—1, 1] with the smallest regularity among 

■■■ ,(3k- 

Proof: Let us consider some s = g o u €z S^'^'^{a, (3,p,L,R) and r = 1 + [a V /3i V 
. . . V . For all D,Di,...,Dk> 1, there exist F G Mi^r{D) and Tj G Mi^r{Dj) for 
all j = 1, . . . ,k such that 

■D{F) < C[ir)D; d^{g,F) < C'2{r)LD-"- 

and, for 1 < j <k, 

V{Tj) < C!^{k, r, p) Dj; d{uj,Tj) < C^k, r, p) Rjk-^DJ^' . 
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If r = Ti + . . . + rfc, then V{T) < Y.)=i 'DiTj), A^(.) (T) < Ejti Aa,, (Tj) < (c(Bi,,) + 
1) Ei=i(^i + !)• Moreover, d{u,T) < Y.]=i d{uj,n) < C'^k-^ Y.]=i RjD/' ^ hence, 
<f{u,T) < {Cifk~^Y!}=iRpJ^^' and finally. 

For all T, Ci^t < C"(a)£ and since Wg{z) < Lz°' for all z G [0,2], we may apply 
Corollary 1 with I = 1 and get that the risk of the resulting estimator satisfies 

k 

j=l ^-^ ^-^ 

We conclude by arguing as in the proof of Theorem 4. [] 

4.4 Multiple index models and artificial neural networks 

In this section, we assume that E = [—1, l]'^, q = 2 and d is the distance in h2{E,fi) 
where fj, is the Lebesgue probability on E. We denote by | • |i and | • |oo respectively 
the £i- and ^co-norms in M'^ and Ck the unit ball for the £i-norm. As we noticed 
earlier, when s is an arbitrary function on E and k is large, there is no hope to get a 
nice estimator for s without some additional assumptions. A very simple one is that 
s{x) can be written as g{{9,x)) for some 6 G Ck, which corresponds to the so-called 
single index model. More generally, we may pretend that s can be well approximated 
by some function s of the form 

= g {{9i,x),. . . , {01, x)) 

where 9i, . . . ,6i are I elements of Ck and g maps [-1,1]' toM, / being possibly unknown 
and larger than k. When s = (7 o n is of this form, the coordinate functions Uj{-) = 
{9j, ■) , for 1 < j < Z, belong to the set Tq C T of functions on E of the form x 1— )■ {9, x) 
with 9 € Ck, which is a subset of a fc-dimensional linear subspace of L2(i?, hence 
^(^o) < k. A slight generalization of this situation leads to the following result. 

Theorem 6 Assume that Theorem 1 holds with q = 2. For j > 1, let Tj be a subset 
of T with finite dimension kj and for I C N* and I G /, let ¥1 be a collection of 
models satisfying Assumptions l-(i and Hi) for some subprobability 7;. There exists 
an estimator 's which satisfies 



CEs [d^{s,s) 



< inf inf 

lei g£Ti,c,uGTi 



d^{s,gou) + A{g,¥u-ii) + T^kji{g, j,Tj] 



(4.6) 



where T/ = Ti x . . . x T;, i{g,j, Tj) is defined by (3.3) and 



A{g,¥i,ji) = inf {dl{g,F) + r [V{F) + A^,(F)]} . 
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In particular, for all I ^ I and {cx,'L)-Holderian functions g with ol G (0,1] and 
C^sid'' {s,s))] 




Let us comment on this result, fixing some value / G /. The term d{s, gou) corresponds 
to the approximation of s by functions of the form g[ui{.), . . . ,ui{.)) with g in J^i^c 
and ui, ... ,ui in Ti, ... ,Ti respectively. As to the quantity A{g, F;, 7^), it corresponds 
to the estimation bound for estimating the function g alone if s were really of the 
previous form. Finally, the quantity r kji{g, j, Tj) corresponds to the sum of the 

statistical errors for estimating the uj. If for all j, the dimensions of the Tj remain 
bounded by some integer k independent of r, which amounts to making a parametric 
assumption on the Uj, and if g is smooth enough the quantity t^^-^^ kji{g, j,Tj) is 
then of order rlogr"^ for small values of r as seen in (4.7). 

Proof of Theorem 6: For all j, we choose Xj to be the Dirac mass at Tj so that 
A\.{Tj) = = d{uj,Tj). The result follows by applying Theorem 2 (for a fixed value 
of Z G /) and then Theorem 3 with u defined by iy{l) = e~' for all / G /. □ 

4.4.1 The multiple index model 

As already mentioned, the multiple index model amounts to assuming that s is of the 
form 

s{x) = g [{9i,x), . . . , {61, x)) whatever x ^ E, 

for some known / > 1 and kj = k for all j. For L > and cx G (M* )', let us 
denote by Sf'''{L) the set of functions s of this form with g G 'H^{[—\, 1]') satisfying 
WdWcx 00 — ^- ^PPlyi'^g Theorem 6 to this special case, we obtain the following result. 

Corollary 2 Assume that Theorem 1 holds with q = 2 and let I C N*. There exists 
an estimator 's such that for all I € I, a £ (M^)' and L > 0, 

sup C'Esld"^ {s,s)] < L'^^T'^ + krC, 

sG5p(L) 

where C = log(T~^) V log{L^k~^) V 1 and C is a constant depending on I and a only. 

The effect of the dimension k only appears in the remaining term. The latter is 
essentially proportional de A:T(log(T~^) V 1), at least for k > L^. It is not difficult to 
see that there is no hope to get a faster rate than kr over Sf^{L). Indeed, by taking 
I = L = 1 for simplicity and g the identity function, we see that Sf^{l) contains the 
unit ball of a /c-dimensional linear space and this is enough to assert that, at least in 
the regression setting, the minimax rate is of order kr. As to the extra logarithmic 
factor log(r~"'^), we do not know whether it is necessary or not. 
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Proof: Fix s = g o u G 5p(L) and apply Theorem 6 with Tj = To for all j > 1, 
I = {I}, ¥i = Mi^r and 7; defined by (4.2) with k = I and r = \ai V ... V a;] . Arguing 
as in the proof of Theorem 4, we obtain an estimator the risk of which satisfies 



inf ( L2L>-2a/; + + 

D>1 



< C" 



2 2a 

L2Q+ir2Q+i + krC 



for constants C and C" depending on I and a only. Finally, we conclude as in the 
proof of Theorem 4. q 



4.4.2 Case of an additive function g 

In the multiple index model, when the value of / is allowed to become large (typically 
not smaller than k) it is often assumed that g is additive, i.e. of the form 



diVi, ■■■,yi) = 9i{yi) + ■ ■ ■+9l{yi) for ah y G [-1, 1]', 



(4.8) 



where the gj are smooth functions from [—1,1] to M. Hereafter, we shall denote by 
jrAdd ^YiQ get of such additivc functions g. The functions s = gou with g S J^f^^'^ and 





u £ Tk hence take the form 



s{x) = gj {{Oj, x)) for all x & E. 



(4.9) 



For each j = 1, let Fj be a countable family of finite dimensional linear subspaces 
of ^1,00 designed to approximate gj and 7j some subprobability measure on Fj. Given 
{Fi, . . . , Fi) G n7=i ^j; '^^ define the subspace F of Ji,oo as 



F = {/(yi, ...,yi) = hivi) + ... + fi{yi) I fj G for 1 < j < / } 



(4.10) 



and denote by F the set of all such F when (Fi, . . . , Fi) varies among rij=i ^j- Then, 
we define a subprobability measure 7 on F by setting 



liF) = ll^AFj} or A,{F) = ^A,^{Fj), 
j=i i=i 

when F is given by (4.10). For such an F, dca{g,F) < Yl^j=idoo{gj, Fj), hence 

dlo{g,F) <lY.]=idlo{9j,Fj) and V{F) < Y!j=i'^{Pj)- We deduce from Theorem 6 
the following result. 

Corollary 3 Assume that Theorem 1 holds with q = 2 and let I C N* and for 
j > 1, let Fj be a collection of finite dimensional linear subspaces of J- 1^00 satisfying 
Assumption 1-i) and-iii) for some subprobability 7j. There exists an estimator's such 
that 



CE M^(s,s)l < inf inf 



d^{s,gou) + ^ {Rj{g,¥j,A^^) +Tki{gJ,To)) 
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where 



Rj{g,¥j,A,^)= inf {dUg, , F,) + r [D{Fj) + A,^{F,)] } forl<j<l. 



Moreover, if s of the form (4-9) for some I G / and functions gj G T-L°'^ {[—1,1]) 
satisfying \\gj\\^. ^ < Lj for aj, Lj > and all j = 1, . . . ,1, one can choose the ¥j 
and 7j in such a way that 



Es [d^{s,s)] < C 



(4.11) 



where C = log (r"^) V 1 V Vj=i log (-^i^^^ 
and ai, . . . ,ai only. 



and C is a constant depending on I 



For j > 1, Rj = Rj{g,¥j, A^.) corresponds to the risk bound for the estimation 
of the function gj alone when we use the family of models ¥j, i.e. what we would get 
if we knew Oj and that gi = for all i ^ j. In short, Yl'^j=i corresponds to the 
estimation rate of the additive function g. If each gj belongs to some smoothness 
class, this rate is similar to that of a real-valued function defined on the line with 
smoothness given by the worst component of g, as seen in (4.11). 

Proof of Corollary 3: The first part is a straightforward consequence of Theorem 6. 
For the second part, fix s = g o u and r = \_ai V ... V ai\ . Since the gj are {aj A 
1, Lj)-Holderian, i{g,j,TQ) < C'C for some C depending on the aj only. By using 
Proposition 1, Lemma 1 and the collection F^^^ = ^i,r with 7^-^^ defined by (4.2), for 

all j = 1, . . . , /, Rj < C" infD>l{L2i?-2a, + < c"(L2/(2°.+l)^2«,/{2a,+l) ^ ^) fo^ 

some constants C , C" depending on the aj only. Putting these bounds together, we 
end up with an estimator 'sr the risk of which is bounded from above by the right- 
hand side of (4.11). We get the result for all values of r by using Theorem 3 and 
arguing as in the proof of Theorem 4. Q 



4.4.3 Artificial neural netvi^orks 

In this section, we consider approximations of s on ii^ = [—1,1]^^ by functions of the 
form 

I 

s{x) =^Rjip{{aj,x) +bj) with \bj\ + \aj\i < 2'' , (4.12) 
i=i 

for given values of {I, r) £ I = {N^f. Here, R = {Ri, . . . , Ri) G M', Oj E R^ bj £ M 
for j = 1, . . . ,1 and i/" is a given uniformly continuous function on M with modulus of 
continuity w^. We denote by Si^r the set of all functions s of the form (4.12). 

Let us now set 'ipr{y) = (2^y) for y G M and, for x £ E, Uj{x) = 2~'' {{aj,x) + bj), 
so that Uj G T belongs to the (/c-l- l)-dimensional spaces of functions of the form x 1— >• 
(a, x) + b. We can then rewrite s in the form gou with g{yi, . . . ,yi) = Yl^j=i Rj'4'r{yj)- 
Since g belongs to the /-dimensional linear space F spanned by the functions ipr{yj), 
we may set F = {F}, A^(F) = and apply Theorem 6. With Wgj{y) = |i?j|w^ (2^2/)) 
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(4.6) becomes, 

CEs[d^ {s,si^r)] <d^s,s) + T{k + l)^mf{ien* \IR]wI (2^6"*) < {k + l)Ti} . 
If w^{y) < Ly°' for some L>0, 0<a<l and all y G M+, then, according to (3.4), 

CEs[d^ {s,si^r)] < d\s,s) + kTlY,[a~^'^og{lR^jL^2^''''[kT]-^)]\/l 

\j=i 

< d\s,s) + lkT^r log i + a-Hog+(^l\R\l^L'^[kT]^^^Y (4.13) 

These bounds being valid for all {l,r) G / and s G 5;^^; we may apply Theorem 3 to 
the family of all estimators 'si^r, (^i^) ^ with u given by u{l,r) = e~^~^. We then 
get the following result. 

Theorem 7 Assume that Theorem 1 holds with q = 2 and that ip is a continuous 
function with modulus of continuity w^{y) bounded by Ly'^ for some L>0,0<a<l 
and all y G M+ . Then one can build an estimator 's = s'(X) such that 

CEs[d^ {s,s))] 

< inf inf < d'^(s, s) + Ikrr 

{l,r)£l seSt,r l 



1 + (ra) ^ log , 



IkT 



]~M }• (4.14) 



Approximation by functions of the form (4.12). Various authors have pro- 
vided conditions on the function s so that it can be approximated within r] by func- 
tions s of the form (4.12) for a given function tp. An extensive list of authors and 
results is provided in Section 4.2.2 of Barron, Birge and Massart (1999) and some 
proofs are provided in Section 8.2 of that paper. The starting point of such approxi- 
mations is the assumed existence of a Fourier representation of s of the form 

s{x) = Ks cos ((a,x) + (5(a)) dFs{a), Kg G M, \5{a)\ < vr, 

for some probability measure Fs on M.^. To each given function that can be used 
for the approximation of s is associated a positive number /3 = /3('0) > and one has 
to assume that 



j \a\^dF,{a) < +oo, (4.15) 



in order to control the approximation of s by functions of the form (4.12). A careful 
inspection of the proof of Proposition 6 in Barron, Birge and Massart (1999) shows 
that, when (4.15) holds, one can derive the following approximation result for s. 
There exist constants > 1, 7^ > and > depending on ip only, a number 
Rs,i3 ^ 1 depending on Cs^i3 only and some s G Si^r with < Rs,i3 such that 

d (s, s) < KsC^ [2-'^^'* + Rs^pl"^'^] for r > r^. (4.16) 
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Putting this bound into (4.14) and omitting the various indices for simplicity, we get 
a risk bound of the form 

n{l, r) = CK^ + R^l-^ + K-^lkrr [l + (ra)-^ log+ {lR^L^[kT]-^)]] , 

to be optimized with respect to / > 1 and r > r^,. We shall actually perform the 
optimization with respect to the first three terms, omitting the logarithmic one. 

Let us first note that, if RK < ^^r^kr^ one should set r = and 1 = 1, which 
leads to 

7^(l,r^) < Ckrr^ [l + (r^a)-i log+ [R^L^ikr]-^)] . 



Otherwise y^r^kr < RK and we set 

r = r* = inf |r > 2-^'"^ < {R/K)V^^ and I = I* 
If I* > 1, then RK{r*kTy^/'^ < I* < 2RK{r*kTy^/'^ hence 



RK 

\fr*kT 



7^(^,r*) < crkV^^^ 



2R^L'^K 
(A;r)3/2^ 



(4.17) 



If I* = 1, then R^ < K '^r*kT and s/r^^pkr < RK < ^r*kT, hence r* > and 
r* — 1 > r*/2. Then, from the definition of r* , 

RK~^ {r* /2)kT < RK~^^{r* - l)kT < 2-2('^*-ih < 2-27^ 

hence Vr*kT < (K/ii)2~27+(i/2) < ^nd (4.17) still holds. To conclude, we 

observe that either — 27r^log2 < log (^RK~^ y^r^kr) and r* = or the solution zq 
of the equation 



2z7 log 2 = log K/ 



RVkT 



(1/2) log z 



satisfies < zq < r* . Since logzo < ■Zo/e, it follows that 



/(27log2 + e-i) 



r* > log ( K/ 
and, by monotonicity, that 



RVkT 



<£=(27log2 + e-i)log+ 



2R^L'^K 



{kTf/'^. 

where £ is a bounded function of kr. One can also check that 

log (A7 



log 



K 



RVkT 



r < r 



27 log 2 



and (4.17) finally leads, when r* > r^, to 



n{l*,r*) < CRK kr 



log (i^/ [i^^Av^]) 

27 log 2 



\ 1/2 

[l + a-^£] 



(4.18) 



In the asymptotic situation where r converges to zero, (4.18) prevails and we get a 

1 /2 

risk bound of order [— fcr log(A:T)] ' . 
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4.5 Estimation of a regression function and PCA 

We consider here the regression framework 

Yi = s{Xi) +ei, i = 1,... ,n, 

where the Xi are random variables with values in some known compact subset K 
of M.^ (with A; > 1 to avoid trivialities) the £i are i.i.d. centered random variables of 
common variance 1 for simplicity and s is an unknown function from M.^ to M. By 
a proper origin and scale change on the Xi, mapping K into the unit ball Bk of M^, 
one may assume that the Xi belong to B^, hence that E = B^, which we shall do 
from now on. We also assume that the Xi are either i.i.d. with common distribution 
I^L on E (random design) or deterministic {Xi = Xi, fixed design), in which case 
= XliLi ^Xi, where 6x denotes the Dirac measure at x. In both cases, we choose 
for d the distance in h2{E,fj.). As already mentioned in Section 2.3, Theorem 1 with 
T = applies to this framework, at least in the two cases when the design is fixed 
and the errors Gaussian (or subgaussian) or when the design is random and the Yi 
are bounded, say with values in [—1,1]. 



4.5.1 Introducing PCA 

Our aim is to estimate s from the observation of the pairs {Xi, Yi) for i = 1, . . . ,n, 
assuming that s belongs to some smoothness class. More precisely, given ^ C M'^ 
and some concave modulus of continuity w on M_|_, we define T-Lvj{A) to be the class 
of functions h on A such that 



\h{x) — h{y)\ < w {\x — y\) for all x,y £ A. 

Here we assume that s is defined on Bk and belongs to 7i„{Bk), in which case it can 
be extended to an element of T-L^ {^^), which we shall use when needed. Typically, 
if w{z) = Lz°^ with a G (0, 1] and the Xi are i.i.d. with uniform distribution /i 
on E, the minimax risk bound over H.^ {Bk) with respect to the L2(-E, //)-loss is 
(j']^2k/{k+2a)^-2a/{k+2a) (-^j^gj-g (J' depends on k and the distribution of the e^). 
It can be quite slow if k is large (see Stone (1982)), although no improvement is 
possible from the minimax point of view if the distribution of the Xi is uniform on 
Bk- Nevertheless, if the data Xi were known to belong to an affine subspace V of 
the dimension / of which is small as compared to k, so that fJ.{V) = 1, estimating the 
function s with 1^2{E, /i)-loss would amount to estimating s o Ily (where Ily denotes 
the orthogonal projector onto V) and one would get the much better rate j^-2«/(i+2a) 
with respect to n for the quadratic risk. Such a situation is seldom encountered in 
practice but we may assume that it is approximately satisfied for some well-chosen V. 
It therefore becomes natural to look for an affine space V with dimension I < k such 
that s and s o IIv are close with respect to the L2(£', /-f)-distance. For s G {^'^), 
it follows from Lemma 3 below that, 



s{x) — s o'n.y{x)\ dfi{x) < w {\x — Uyxl) dfi{x) 

E Je 



< 2w" 



J \x — Ilvx\^ diJ,{x)^ 



1/2 
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and minimizing the right-hand side amounts to finding an affine space V with dimen- 
sion / for which \x — Hyxp d^{x) is minimum. This way of reducing the dimension 
is usually known as PC A (for Principal Components Analysis). When the Xi are de- 
terministic and /i = Yll=i the solution to this minimization problem is given 
by the affine space Vi = a + Wi where the origin a = = Y^^=i -^i ^ 
Wi is the linear space generated by the eigenvectors associated to the I largest eigen- 
values (counted with their multiplicity) of XX* (where X is the k x n matrix with 
columns Xi — Xn and X* is the transpose of X). In the general case, it suffices to 
set CL = xdfj, (so that a £ E) and replace XX* by the matrix 

r = / {x — a){x — a)* dfi{x). 
Je 

If Ai > A2 > . . . > Afc > are the eigenvalues of F in nonincreasing order, then 

k 

\j (4.19) 



J \x — Ilvx\'^ dfi{x) = Xj 
(with the convention ^j^, = 0) and therefore 



inf 

{V I dim{V)=l} . 



j=l+l 



inf 

{V\ dim(V)=0 



1 1 2 1 1 1 1 2 2 

|s — s o Ily < ||s — s o IIv; II 2 < 2w 




(4.20) 



4.5.2 PCA and composite functions 

In order to put the problem at hand into our framework, we have to express sollv, in 
the form g o u. To do so we consider an orthonormal basis ui, . . . ,Uk of eigenvectors 
of XX* or r (according to the situation) corresponding to the ordered eigenvalues 
> A2 > . . . > Afc > 0. For a given value of I < k we denote by a"*" the component 
of a which is orthogonal to the linear span Wi oiui, . . . ,ui and for x G B^, we define 



(x) = {x, Uj) for j = 1, . . . ,1. This results in an element u 



(ui, 



,ui) of T and 



+ ^j=iUj{x)uj =nv;(x) is the projection of x onto the afhne space V/ = a +Wi. 



Setting 



9(=) 



ZjUj 



for z G [-l,l]^ 



leads to a function g o u with u £ and g G J-i^c which coincides with s o 11^^ on 
Bk as required. Consequently, the right-hand side of (4.20) provides a bound on the 
distance between s and g o u. Moreover, since s G "Hw (l^'^). 



9{z')\ < 



w 



I I 
i=i i=i 



w| > (z,--2;„-)Ui I =w(|z-z'|), 

(4.21) 

so that we may set Wgj = w for all j G {1, . . . , Z}. 

In the following sections we shall use this preliminary result in order to establish 
risk bounds for estimators 'si of s, distinguishing between the two situations where /i 
is known and /.f is unknown. 
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4.5.3 Case of a known fi 

For D € N*, we consider the partition Vi^d of [—1, 1]' into cubes with edge length 
2/D and denote by Fi^d the Unear space of functions which are piecewise constant on 
each element of Vi^d so that ^^{Fi^d) = for all D G N*. This leads to the family 
F = {Fi^D, D e N*'} and we set j(Fi^d) = e'^ for all D > 1. We define uj as in the 
previous section and take for Tj the family reduced to the single model Tj = {uj} 
for j = 1, . . . ,1. Then T>{Tj) = for all j and we take for Xj the Dirac measure on 
Tj. This leads to a set 6 which satisfies Assumption 1 and we may therefore apply 
Theorem 2 which leads to an estimator 'si with a risk bounded by 

+ D 



CE, 



^ ii2 



< d (s, q o u) -\- inf 
D>1 



n 



Since s o Ily^ and g o u coincide on Bk, it follows from (4.20) that 



\s — g o u\\2 



\s — s o ILvi II2 ^ 2w^ 




Moreover, for all cubes / G ^md x £ I, the Euclidean distance between x and 
the center of / is at most VlD^^, hence by (4.21), doo{g, Fi^£)) < w (^VlD~^^ for all 
D > 1. Putting these inequalities together we see that the risk of is bounded by 



CE, 



si 



< w^ 



A 



k 

j=i 



+ inf 

D>1 



w 



n 



(4.22) 



4.5.4 Case of an unknown /i 

When fj, corresponds to an unknown distribution of the Xi, the matrix T is unknown, 
its eigenvectors ui, . . . ,Uk and the vector a as well and therefore also the elements 
111, . . . , It/ of T. In order to cope with this problem, we have to approximate the 
unknown Uj which requires to modify the definition of Tj given in the previous section, 
keeping all other things unchanged. For each G M'^ with \v\ < 1, we denote by ty the 
linear map, element of T, given by ti,{x) = {x,v). Denoting by B^, the unit sphere in 
M'^ we then set, for all j, Tj = T = {i^,, f G B'^} which is a subset of a /c-dimensional 
linear subspace of L2(/u). It follows that Assumption 1 remains satisfied but now with 
T)(Tj) = k. Since Uj G Tj for all j, an application of Theorem 2 leads to 



CEs [d^ (s, s) ] < - E iiaJ, T) + d\s, gou)+ inf {^^(5, F,,^) + — 
n ^ — ' D>1 
j=i - 



D^ + D 



n 



where i{g,j,T) is given by (3.3). Since, by (4.21), Wgj = 
i{g,j,T) = i = inf |i G N* I /w^ (e"') 
Arguing as in the case of a known fi, we get 



w for all j G {1, . . . , 

ik 
n 



< 



CE, 



^ l|2 



< 



kli 



+ w^ 



n 




+ 



n 
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Let i£) 



i>iz) + l>2 and 



log [D/y/l) . If i < io, then kli/n < klD/n since in ^ D. Otherwise, 



n 2n 

which shows that kli/n < 2Z^w^ ) + klD/n. Finally 



C7E, 



< w^ 



< w^ 



w 



2L>' 



n 



which is, up to constants, the same as (4.22). We do not know whether the multi- 
plicative factor Ik arising here and missing in (4.22) can be improved or not. 



4.5.5 Varying / 

The previous bounds are valid for all values of / G / = {1, . . . , A;} but we do not know 
which value of I will lead to the best estimator. We may therefore apply Theorem 3 
with = Z~^/2 for I € / which leads to the following risk bound for the new 
estimator s in the case of a known 



CE, 



\s-n 



< inf inf 

l£{l,...,k} D>1 













3=1+1 I 



+ log / 



n 



Apart from multiplicative constants depending only on /c, the same result holds when 
H is unknown. If w(z) = Lz" for some L > and a G (0, 1], we get, since "^Z^j^i+i — 
{k — l)Xi+i (with the convention A^+i = 0), 



CE, 



\s — s\ 



< inf inf <j L'[{k - l)\i+,r + L^rn-''^ + + 



ie{i,...,k} D>1 



n 



Assuming that n > L ^ to avoid trivialities and choosing D = (nL^/°) ^^^'"'^^"^ 
finally get 



, we 



CE, 



s — s 



f I , T2l/{l+2a) ) 

< inf \L^[{k-l)Xi+ir + ^+ „ . }. 

le{l,...,k} / T- J ^ ^2a/{l+2a) 



For I = k, we recover (up to constants) the minimax risk bound over Ti^iSk), namely 
C'(A:)L2fc/(fc+2a)„-2a/(fc+2a) _ Therefore our procedure can only improve the risk as 
compared to the minimax approach. 
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4.6 Introducing parametric models 



In this section, we approximate s by functions of the form s = gou where g belongs to 
J^i^c and the components Uj of u to parametric models Tj = {uj{0, .), G Qj} C T 



indexed by subsets Qj of 
holds. 



with kj > 1. Besides, we assume that the following 



Assumption 3 For each j = 1, . . . ,1, Qj C Bk {0, Mj) for some positive number Mj 
and the mapping i— )• Uj{0, .) from Qj to {T,d) is {f3j , Rj)-Hdlderian for f3j G (0, 1] 



and Rj > which means that 



d{uj{e, .),uj{e', .)) < Rj \o - e'p for aii e,e' e Qj. 

Under such an assumption, the following result holds. 



(4.23) 



Theorem 8 Assume that Theorem 1 holds and let I > 1, Ti, . . . ,Ti be parametric 
sets satisfying Assumption 3, ¥ be a collection of models satisfying Assumption 1-i) 
and J be a subprobability on ¥. There exists an estimator 's such that 

CEs [d\s,s)] 

< d\s,go u) + inf [dl,{g, F) + t(A^(F) + V{F))\ 

t tir 



^kj\og{l + 2MjR 



+ Vmf 

^ j>i 



l^lAe-')+iT[l + kjPT^ 



9,J 



for all g G J^i^c CLnd Uj&Tj,j = l,...,l. 

In particular, for all {a,'L)-Hdlderian functions g with a G (0, 1]^ and L G v^^+y ; 



CEs[d\s,s)] < d\s,gou)+ml[dl{g,F)+T{AAF) + ViF))] 



r J2 [kj log(l + 2MjR]^'^') + {Cj V 1) (l + kjp-^ 



where 



2a, 



■log 



IL] 



for j = 1, 



(4.24) 



(4.25) 



Although this theorem is stated for a given value of Z, we may, arguing as before, let I 
vary and design a new estimator which achieves the same risk bounds (apart for the 
constant C) whatever the value of I. 

As usual, the quantity inf^gF [d%:>{g,F) +''"('^7(-^) + ^(-^))] corresponds to the 
estimation rate for the function g alone by using the collection F. In particular, if 
g G n^{[-l, lY) with a G (M|)', this bound is of order r^^/i^^+i) as r tends to for 
a classical choice of F (see Section 4.1). Since for all j, g is also {aj A l)-Holderian 
as a function of Xj alone, the last term in the right-hand side of (4.24), which is of 
order — rlogr, becomes negligible as compared to 7-2c*/(2a+0 therefore, when s is 
really of the form gou with g G Ti^d—l, 1]') the rate we get for estimating s is the 
same as that for estimating g. 
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Proof of Theorem 8: For > and j = 1, let 0j[r/] be a maximal subset of 0j 
satisfying \t — t'\ > ij for all t,t' in 0j[r/]. Since Qj is a subset of the Euclidean ball 
in M'^-' centered at with radius Mj, it follows from classical entropy computations 
(see Lemma 4 in Birge (2006)) that log |9j[7?]| < kj log(l + 2Mj?/~^). For all i e N*, 
let Tj^i be the image of Qj^i = Gj[(i?je*)~^/'^j] by the mapping i— )• Uj{6, .). Clearly, 

log \Tj^i\ < log IGj'il < kj log (l + 2MjR]^'^'e'/^A < kj [log(l + IMjE^^') + i/S"^ 



and because of the maximality of Qj^i and (4.23), for all G Qj there exists 6 £ Qj i 
such that d{uj{e, .),Uj{e, .)) < Rj \e - of' < e"* so that T,- j is an e'^-net for T^-. For 
J = 1, we set Tj = (Ji>i ^i,* so that the models in Tj are merely the elements of 

the sets Tj^i. For a model T that belongs to 7j',AUi<i'<j ^il^"* ] (with the convention 
(Jg, = 0) we set 

AA^,(r) = log IT,-, I +i<k, log (l + 2MjR]^^'^ + k,pf) 

which defines a measure Aj on Tj satisfying 

TeTj i>l tGTj,i i>l 

Since for all j and T G Tj, V{T) = 0, we get the first risk bound by applying 
Theorem 2 to the corresponding set 6. To prove (4.24), let us set = V 1 
for j = 1, . . . ,1 with Cj given by (4.25), so that 1 < < £j V 1 and notice that, if 

z>CjV 1, then ZL^e-^"^^ < zt (l + kjjij^^ . If Cj > 1, then Cj < + 1 < 2Cj, 
hence 

l^y2a,m+i) < (^(j-) + 1)^ (^1 + k,(3~') < 2C,T (l + kjf3r^) 

and 

/w2/e-^(^)) < /L2e-2"^*(J) < 2e2"^£jT (l + kj^r^^ < 2e^ CjT (\ + %/3ri) . 
Otherwise, Cj < 1, i(j) = 1 > £j V 1 and 



so that in both cases /w2^.(e-^(j)) < 2e'^{Cj V 1)t (^1 + /cj/S^i j , which leads to the 
conclusion. [] 



4.6.1 Estimating a density by a mixture of Gaussian densities 

In this section, we consider the problem of estimating a bounded density s with respect 
to some probability /i (to be specified later) on E = M.^, d denoting, as before, the 
L2-distance on h2{E,fi). We recall from Section 2.3 that Theorem 1 applies to this 
situation with r = n~^||s||oo(l V log ||s||oo)- A common way of modeling a density on 
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E = M*^ is to assume that it is a mixture of Gaussian densities (or close enough to 
it). More precisely, we wish to approximate s by functions s of the form 



s{x) = ''^^qjp{mj,T,j,x) for all x€ 



(4.26) 



where / > 1, q = {qi,...,qi) S [0,1]' satisfies Yl^j^iqj = 1 and for j = 
p{mj , , .) = dM {nij , S|) / dfi denotes the density (with respect to fi) of the Gaussian 
distribution N'{mj,'E'j) centered at mj G M'^ with covariance matrix S| for some 
symmetric positive definite matrix Tij. Throughout this section, we shall restrict 
to means rrij with Euclidean norms not larger than some positive number r and to 
matrices Sj with eigenvalues p satisfying p < p <'p for positive numbers p < ]). In 
order to parametrize the corresponding densities, we introduce the set Q gathering 
the elements 9 of the form = (m, S) where S is a positive symmetric matrix with 
eigenvalues in [p,^] and m G Bk{0,r). We shall consider as a subset of R'^('^+i) 
endowed with the Euclidean distance. In particular, the set of square k x k 
matrices of dimension k is identified to M'^ and endowed with the Euclidean distance 
and the corresponding norm N defined by 

k k 

^'(^) = E E 4. if ^ = {^^,3) l<^<k ■ 

i=i j=i i<i<fc 

This norm derives from the inner product [^, -B] = tr(^i?*) (where B* denotes the 
transpose of B) on and satisfies N(AB) < N(A)N{B) (by Cauchy-Schwarz in- 
equality) and N{A) = N{UAU~^) for all orthogonal matrices U. In particular, if A 
is symmetric and positive with eigenvalues bounded from above by c, N{A) < y/kc. 
We shall use these properties later on. For b = r^/(2p^) + klog{V^p/ p) and p the 



Gaussian distribution A/'(0, 2p^/fe) on 
define the parametric set T by 



(where Ik denotes the identity matrix) we 



For parameters 6i = (mi. Si), . . . ,6i = {mi, in 0, the density s can be viewed as 
a composite function g o u with 



9{yi, ■■■,yi) = e qivl + ... + € qiyf 



(4.27) 



and u = (til, . . . , ui) with Uj{.) = u{9j, .) for j = 1, . . . , /. With our choices of b and 
p, u(9, .) G T for all 9 = (m, S) G as required, since for all x £ E 



4p 



Pi0,x) = ^^exp 

< (2pV-')'/'exp 



x\ 



S ^{x — m) 



I i2 I i2 

\x — m\ \m\ 
+ 



Is ^{x — m)\ 



itt2 



2p 



2-p 



'Tt2 



< {2fp-^f'^e'-"l^'^"'^ < e^ 



An application of Theorem 8 leads to the following result. 
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Corollary 4 Let s he a hounded density in L2(i?, //), .) he the 'L2-distance, r = 
n-i||s||oo(lVlog ||s||oo), M = Vkp+r, b = r^/{2p'^)+klog{V2p/p), R = ^/k/2e~^/^p~^ 
and 

, 1, / Ale^^T-^ 

There exists an estimator 's satisfying for some universal constant C > 

CEs [d^{s, s)] < inf [d^{s, g o u)] + lk{k + 1)t [log(l + 2MR) + {C{t) V 1)] , (4.28) 

where the infimum runs among all functions u = (ui, . . . ,ui) G and g of the 
form (4.27). 

The second term in the right-hand side of (4.28) does not depend on g nor u and is 
of order — r log r as r tends to 0. As already mentioned, one can also consider many 
values of / simultaneously and find the best one by using Theorem 3. Up to a possibly 
different constant C, the risk of the resulting estimator then satisfies (4.28) for all 
/ > 1 simultaneously. The problem of estimating the parameters involved in a mixture 
of Gaussian densities in Mf' has also been considered by Maugis and Michel (2011). 
Their approach is based on model selection among a family of parametric models 
consisting of densities of the form (4.26). Nevertheless, they restrict to Gaussian 
densities with specific forms of covariance matrices only. 

Proof of Corollary 4: First note that for all ^ £ 0, \Q\ = \m\ + N{Y,) < r + ^/k'p. 
Hence, if we can prove that for all 6*0 = {rriQ, Sq)) = ("ii; ^^i) in © 

d {u{eo, .),uieu .)) < ^^-^^ 1^0 -Oi\, (4.29) 

Assumption 3 will be satisfied with 

Mj = M = r + Vkp and Rj = y^k/2e'^^^ fT^ = R for j = 1, . . . , /. 

We shall therefore be able to apply Theorem 8 with Tj = T for all j, r = n Halloo (IV 
log ||s||oo), F = {F} where F is the linear span of dimension T){F) = I of functions g of 
the form (4.27) and 7 the Dirac mass at F. Since the functions g of the form (4.27) are 
L-Lipschitz with Lj = 2qje^ < 2e^ for all j, we shall finally deduce (4.28) from (4.24). 
We therefore only have to prove (4.29). Let us first note that 

d2(n(eo,.),^(^i,-)) = 2e^'/i'(AA(mo,5]2),AA(mi,S?)), (4.30) 

where h denotes the Hellinger distance defined by (1.1). Some classical calculations 
show that 



(AA(mo,s2),AA(mi,s2)) 



^ _ exp [-|(mi - mp, (Sg + Sf) ^(mi-mp))] 



det ^ 



and from the inequalities, 1 — e < z and log(det A) < ti{A — J^) which hold for all 
z G M and all matrices A such that det ^ > 0, by setting = Sq + we deduce 
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that 



4/i2(Ar(mo,Sg),Ar(mi,E?)) 



< 2 log 



det 



+ {mi-mo,T, ^(mi-mo)) 



2 

< tr (Sq ^Si + SoSi"^ - 24) + {nil - ™o, - mo)) 

= tr ((So - i;i)i;o^(So - Si)S~^) + (mi - mo,i;"^(mi - mo)) = U1 + U2, 

with 

J7i = tr ((So — Si)Sq ^(Sq — Si)Sj"^) and [/2 = (mi — mo, S~^(mi — mo)). 

It remains to bound Ui and U2 from above. For Ui, taking A = (So — Si)So'^ and 
B = Sj~^(So — Si) and using the fact that the eigenvalues of Sq ^ and S^^ are not 
larger than p"^, we get 

Ui = [A,B] < N{A)N{B) = iV((So-Si)Soi)iV(Sr^(So-Si)) 

< iv(s-i)iv(sri)iv2(So-Si) < ^^"(^0-^1) _ 

Let us now turn to C/2. It follows from the same arguments that the symmetric matrix 
S2 = Sg + Sf satisfies for ah x € M^ 

(S^x, x) = |Soxp + |Six|^ > 2p^ |xp , 

hence 

TT I N\ ^ |mo - mil^ 
(72 = (mi- mo, S (mi - mo)) < ^ . 

Putting these bounds together, we obtain that 

4/i2 (AA(mo, Sg), AA(mi, S?)) < (a^^^^.^ _ ^ _ ^^^|2^ = ^ l^o - , 

which, together with (4.30), leads to (4.29). □ 

5 Proofs of the main results 

Let us recall that, in this section, d denotes the distance associated to the || ||g norm 
of L,q{E,fi) and doo the distance associated to the supnorm on ^7^00- 

5.1 Preliminary approximation results 

The purpose of this section is to see how well f ot approximates gou when we know 
how well / approximates g and t = {ti, . . . ,ti) approximates u. 

Proposition 4 Let p > I, g & J-'i^c, f G -^/,oo o-nd t,u G . If Wg is a modulus of 
continuity for g, then 

I 

\\gou - f ot\\p < doo{g,f) + 2^/*'^Wgj {\\uj - tjWp) 

i=i 

with the convention 2^^°° = 1. 
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Proof: It relies on the following lemma the proof of which is postponed to the end of 
the section. 

Lemma 3 Let (E, £, fi) be some probability space and w some nondecreasing and 
nonnegative concave function on M+ such that w{0) = 0. For all p € [1, +oo] and 
h G Lp(//), 

\\wi\h\)\\,<2^/pwm\p), 

with the convention 2^^°° = 1. 

We argue as follows. For all y, y' G [-1, 1]', \g{y) - g{y')\ < Y!j=i ^gjilVj - Vjl) and, 
since // is a probability on E, 

\\gou-fot\\p < \\gou- got\\p+\\got- f ot\\p 

I 

i=i 
I 



< 



+ \\9°t- f °t\ 



^ y]l|wgj(|'Wj sup \g{y)-f{y)\ 

y&[-l,lY 



< 2^/PY,Wg,j i\\uj - t,\\p) + d^{g, /), 
i=i 

which proves the proposition. q 

Proof of Lemma 3: Since there is nothing to prove if ||^||p = 0, we shall assume that 
\\h\\p > 0. The assumptions on w imply that, for all < a < 6, b~^w{b) < a~^w{a) 
and u;(a) < wi^b). Consequently, for p G [1, +oo[, 



wP{\h\)dfi 



wPi\h\)l^^<,df,+ / vuPi\h\)l^^^,df, 

E JE 



wP{b) 



^ wnb) + l^^^^j^\h\n\h\>bdfi < wP{b) + r:^JJh\Pdf,, 



and the result follows by choosing b 
letting p go to +oo. [] 



The case p = oo can be deduced by 



5.2 Basic theorem 

We shall first prove a general theorem of independent interest that applies to finite 
models T for functions in and is at the core of all our further developments. 

Theorem 9 Let I be a countable set and v a subprobability on I. Assume that, for 
each i ^ L, we are given two countable families Tg and of subsets of and J-i^oo 
respectively such that each element T of is finite and each F € Fi is a linear 
subspace of dimension V{F) > 1 of Fi ^o- Let and 7^ be subprobabilities on and 
¥1 respectively. One can design an estimator 's = satisfying, for all i £ I, all 
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u ^ and g G J-i^c with modulus of continuity Wg, 



< inf < 

TeTf 



/ inf ^ w2^^. {\\uj - t,\\p) + r [Aa,(T) + log |T| + A,{i)] 

'2/ \ , ■ r ( j2 



+ d\s, gou)+ inf {dlig, F) + r [V{F) + A^,(F)]} 



FeF. 



Proof: For each t € UtsT^ ^^"^ F G we consider the set Ft = {f o t, / € F} C 
'Lq{E,fi), which is a P(F)-dimensional hnear space. This leads to a new countable 
family of models §e together with a subprobability vr^ on Sg given by 



Ft,t£ \J T,F£¥i 



tgt« 



TGTf.Tgt 



(5.1) 



We then set 



and TT{Ft) = iy{i)TTe{Ft) for Ft e 



It follows that 

A^{Ft) = A^,(F) + ^ Jnf^_^, [Aa,(T) + log(|T|)] + A,{£) for Ft G 



TeT» ,T3t 



Applying Theorem 1 to S and vr leads to an estimator s satisfying, for each i £ I, 

^ ^eF,i^4,teT^^'^"' + ^ + ^^^^^ + ^'^^^^ + + ^'^^^^^ ■ 

We now use Proposition 4 which implies that, for each f of £ Ft, 
(f{s,fot) < {\\s - g ou\\q + \\g ou - f ot\\gf 



< \\s- gou\\g + dooig,f) + 2^^^^Wgj -tjWq) 



I 



< 3 



for some universal constant C since 2^/"? < 2. The conclusion follows from a mini- 
mization over all possible choices for / and t. q 

5.3 Building new models 

In order to use Theorem 9, which applies to finite sets T, starting from the models 
T which satisfy Assumption 1, we need to derive new models from the original ones. 
Let us first observe that, since Uj takes its values in [—1, 1] and ^ is a probability on 
E, d{0,Uj) < 1. It is consequently useless to try to approximate Uj by elements of 
Lg(F, /i) that do not belong to B{0,2) since always does better. We may therefore 
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replace T C "LqiE, ji) by (T n S(0, 2)) U {0}, denoting again the resulting set, which 
remains a subset of some P(r)-dimensional linear space, by T. Moreover, this mod- 
ification can only decrease the value of d(T,Uj). Since now T C B{0,2), we can use 
the discretization argument described by the following lemma. 

Lemma 4 Let T C S(0, 2) he either a singleton (in which case T){T) = 0) or a 
subset of some T){T)- dimensional linear subspace of'Lq{E,^) with T){T) > 1. For 
each 1] G (0, 1], one can find a subset T[r]] ofT with cardinality bounded by {5/i])^^'^^ 
such that 

inf d(t, v) < inf d(t, v) + [r] A V(T)] for all v eT. (5.2) 

teT[ri] teT 

Proof: If V(T) = 0, then T = {t}, we set T[r]] = {(-1 V A 1} and the result is 
immediate since v takes its values in [—1,1]. Otherwise, let T' be a maximal subset 
of T such that d{t, t') > r] for each pair (t, t') of distinct points in T'. Then, for 
each t gT there exists t' G T' such that d{t,t') < rj and it follows from Lemma 4 in 
Birge (2006) that \T'\ < (5/?7)^(^). Now set T[ri] = {(-1 Vt) Al, t G T'}. Then (5.2) 
holds since T>{T) > 1. [] 

We are now in a position to build discrete models for approximating the elements of 
Given j G {1, . . . , /}, Tj in Tj and some i G N*, the previous lemma provides a 
set Tj [e~*] satisfying \Tj [e~*] | < exp [D{Tj) {i + log 5)]. Moreover, 

d {uj,Tj [e~'] ) < d {uj,Tj) + [e~' A V{Tj)] for aU n G T' and i G N*. (5.3) 

We then define the family T of models by 

T = |t = JJ Tj [e-*^] with {ij,Tj) G N* x Tj for j = 1, . . . , / | . (5.4) 

Then each T = Ti [e'''^] X . . . X Ti[e *'] in T has a finite cardinality bounded by 

I 

log|T| < J^P(T,)(ij+log5). (5.5) 
j=i 

5.4 Proof of Theorem 2 

Starting from the families Tj, 1 < j < I, we build the set T given by (5.4) as 
indicated in the previous section and we apply Theorem 9 to F and T. This requires 
to define a suitable subprobability A on T, which can be done by setting, for each 
T = Ti [e-^i] X ...xTi [e-''] in T, 

I I 
\{T) = l[\,{T,)e^p[-ijV{T,)] or Aa(T) = ^ [A,^, (T,-) + i,P(r,)] . 

Applying Theorem 9 to F and T with / reduced to a single element and the Dirac 
measure and using (5.5) and (5.3) which implies that 

M wg^,{\\uj-t^\\p) < w,,,([e-'^- AP(T,-)] +d(n„T,-)) 



< 
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by the subadditivity property of the modulus of continuity Wgj, we get the risk bound 

I 

+ r [Aa, (Tj) + {2ij + log 5)V{Tj)] } 
+ d\s, gou)+ mUdlig, F) + r [V{F) + A,{F)] ] , 

where the first infimum runs among all Tj £ Tj and all ij £ N* for j = 1,. . . ,1. 
Setting ij = i{g,j,Tj) implies that Iw'^^j (e"*^ AP(T,)) < rij'DiTj), which proves 
(3.2). As to (3.6), it simply derives from the fact that, if T){T) > 1, then 

i{g,j,T) < \{2a,r^\og{lL][TV{T)r')] < 

5.5 Proof of Theorem 3 

It follows exactly the line of proof of Theorem 2 via Theorem 9 with an additional 
step in order to mix the different families of models corresponding to the various sets 
Gi- To each corresponds a family of models and a subprobability vr^ on given 
by (5.1). We again apply Theorem 9 with / and v as given in Theorem 3. 



aTilog {lL][TV{T)r^)\\ll = C,,T. 



5.6 Proof of Lemma 1 



li D = 1, we get the bound a + h. When a > 6, we can choose D such that 
(a/6)V(''+i) <D< (a/&)i/(«+i) + l, so that 



aD~^ + bD< a{a/h)'^'^^+^'^ + h \{a/h) 



1/(9+1) 



+ 1 



6 + 2aV(^+i)6''/(^+i) 



and the bound h + [20^/(^+1) A a] follows. If 6 > a, the bound 2h holds, 

otherwise h < 01/(^+1)6^/(9+1) and the conclusion follows. 



5.7 Proof of Proposition 3 

It suffices to show that for all i G {1, . . . ,k} and x € [—1, l]'^, the map g o Ux{t) = 
g o u{xi, . . . , Xi-i,t, Xi+i, . . . , Xk) from [—1, 1] into M belongs to Ti^d—l, l]'^). If at 
least a or /3j are not larger than 1, the result is clear. Otherwise both are larger than 
1 and we can write /3i = 6j + /3'- and a = a + a' with a, 6 G N* and /3-, a' G (0, 1]. Both 
functions g and Ux are biAa times differentiable and the derivatives g^^'^ o Ux and u^^ 
for £ = 0, ... , biAa are Holderian with smoothness p = {/3i — biAa) A{a — biAa) E (0, 1]. 
Since the derivative of order bi A a of g o Ux is a polynomial with respect to these 
functions, we derive (4.4) from the fact that the set (■H^([— 1, 1]''),+, .) is an algebra 
on R. 

We shall prove the second part of the proposition for the case k = 1 only since the 
general case can be proved by similar arguments. For p > 0, let hp G 7iP{[—l, 1]) \ 
IJpi^p'H^ ([—1, 1]). Given a,/3 > 0, we distinguish between the cases below and the 
reader can check that for each of these g G ?^"([-l,l]), u G ^^'^([-1,1]), g o u G 
-^^([-1,1]) with e = (l){a,p) but gou ^ ^^'([-1,1]) whatever 6' > 9. If a,/3 < 1, 
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take g{x) = and u{y) = \y\^ for all x, y £ [—1, 1], if 1 < /3 and a < (3, take g = ha 
and u{y) = y for all y € [—1,1], finally, if q > 1 and a > (3, take g{x) = x for all 
X G [—1, 1] and u = hj^. 

5.8 Proof of Lemma 2 

For all a > 0, the map defined for y in (0, +oo) by 

(p{a,l/y) \ a-^ otherwise, 
is positive, piecewise linear and convex. Hence, 

and equality holds if and only if /3j < (a V 1) for all i or if for alH, /3j > (a V 1). We 
conclude by using the fact that (^(a, z) < z{a A 1) for all positive number z and that 
equality holds if and only if z < a V 1. 
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